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Abstract 


The  pertormance  dI  tradilional  RAID  Level  5  arrays  is.  lor  many  applicaiions.  unaccepiably  poor 
while  one  of  its  constiiueni  disks  is  non-lunciional.  This  paper  deseribos  and  evaliiaies  mecha¬ 
nisms  by  which  this  disk  array  failure-recovery  performance  can  be  improved.  The  iwo  key  issues 
addressed  are  the  data  layout,  the  mapping  by  which  data  and  parity  blocks  are  assigned  to  physi¬ 
cal  disk  blocks  in  an  array,  and  the  reconstruction  algorithm,  which  is  the  techni(.)ue  used  to 
recover  data  that  is  lost  when  a  component  disk  fails. 

The  data  layout  techniques  this  pap<-  ,-  investigates  are  instantiations  ol  the  declustered  parity  orga¬ 
nization,  a  derivative  of  RAID  Level  5  that  allows  a  .system  to  trade  some  of  its  data  capacity  for 
improved  failure-recovery  performance.  We  .show  that  our  instantiations  ol  parity  deckistering 
improve  the  failure-mode  performance  of  an  array  .signilicantly,  and  that  a  parity-deckistered 
architecture  is  preferable  to  an  equivalent-.size  multiple-group  RAID  Level  5  organization  in  envi¬ 
ronments  where  failure-recovery  performance  is  important.  The  presented  analyses  also  include 
compari.sons  to  a  RAID  Level  1  (mirrored  di.sks)  approach. 

With  respect  to  reconstruction  algorithms,  this  paper  describes  and  brielly  evaluates  two  alterna¬ 
tives  stripe-oriented  reconstruction  and  di.sk-oriented  reconstruction,  and  establishes  that  the  laker 
is  preferable  as  it  provides  faster  reconstruction.  The  paper  then  revisits  a  set  of  pre\  ioii.sly-pro- 
posed  reconstruction  optimizations,  evaluating  their  efiicacy  when  ii.sed  in  coniunction  with  the 
disk-oriented  algorithm.  The  paper  concludes  with  a  section  on  the  reliability  versus  capacity 
trade-off  that  must  be  addressed  when  designing  large  arrays. 


1.  Introduction 

The  pert'ormanee  of  a  storage  subsysient  during  its  recovery  Irom  a  disk  lailurc  is  crucial  lo 
applications  such  as  on-line  transaction  processing  (OLTP)  that  mandate  both  hign  I/O  perfor¬ 
mance  and  high  data  reliability.  Such  .systems  demand  not  only  the  ability  to  recover  Irom  a  disk 
failure  without  lo.sing  data,  but  al.so  that  the  recovery  process  { I )  function  without  taking  the  sys¬ 
tem  off-line,  (2)  rapidly  restore  the  .sy.stem  to  its  fault-free  state,  and  {?>)  have  minimal  impact  on 
system  performance  as  observed  by  u.sers.  Condition  (2)  ensures  that  the  system  s  vulnerability  to 
data  lo.ss  is  minimal,  while  conditions  (1)  and  {^)  provieie  lor  on-line  failure  recovery.  good 
example  is  an  airline  reservation  system,  where  inadequate  recovery  from  a  disk  crash  can  cause 
an  interruption  in  the  availability  of  hooking  information  and  thus  lead  to  (light  delays  and/or  res  - 
enue  lo.ss.  Furthermore,  becau.se  fault-tolerant  storage  .systems  exhibit  degraded  perlormance 
while  recovering  from  the  failure  of  a  component  disk,  the  lauli-free  sy.stem  load  must  be  kept 
light  enough  for  performance  during  recovery  to  be  acceptable.  For  this  reason,  a  decrea.se  in  per¬ 
formance  degradation  during  failure  recovery  can  translate  directly  into  improved  lault-liee  per¬ 
formance.  With  this  in  mind,  the  twin  goals  of  the  techniques  di.scussed  m  this  paper  are  lo 
minimize  the  time  taken  to  recover  the  content  oi  a  failed  disk  onto  a  replacemeni;  dial  is,  to 
restore  the  sy.stem  to  the  fault-free  slate,  and  to  .simultaneously  minimi/e  the  impact  o(  laihire 
recovery  on  the  performance  of  the  array  (throughput  and  respon.se  lime)  as  ob.served  by  users. 

Fault-tolerance  in  a  data  .storage  sub.sy.stem  is  generally  achieved  either  by  disk  minorur^ 
|Kalzman77,  BittonXS,  CopelandS*^).  Hsiao^lj.  or  by  parin  (’luodiiv^  1  Arulpraga.samXO.  (lib- 
.sony3.  KimXb.  ParkXb,  Palier.sonXX|,  In  the  former,  one  or  more  duplicate  copies  oi  each  u.ser  data 
unit  are  .stored  on  separate  disks.  In  the  latter,  commonly  known  as  Redundant  Arrays  ol  Inexpen¬ 
sive*  Di.sks  (RAID)  Levels  .L  4,  and  5  |Patler.sonXX|.  a  small  portion  ol  the  arrav's  phvsical  capac¬ 
ity  is  u.sed  to  .store  an  error  correcting  code  computed  over  the  data  stored  in  the  array.  The 
additional  storage  required  for  redundancy  can  be  as  large  as  25'/,  ol  the  capacity  ol  the  array,  but 
IS  olten  much  smaller.  Studies  jChenyOa.  Grayy()|  have  shown  that,  due  to  superior  peitormance 
on  small  read  and  write  operations,  a  mirrored  array,  al.so  known  as  RAID  Level  1.  can  deliver 


1.  Bccau.sc  ot  industrial  inicrosi  in  using  the  RAID  acronym  and  because  ol  ilieir  concerns  ahoui 
the  rcstrictivcncss  of  its  ‘Inexpensive  ’  component.  RAID  is  sometimes  reponed  as  an  acronviii  lor 
Rcxlundant  Arrays  ol  Independent  Disks  |RAIDy,^|, 
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higher  perfornianec  lo  OLTP  workloads  than  can  a  pariiy-basod  array.  Unloriunaiely,  mirroring  is 
subslaniially  more  expensive  —  its  storage  overhead  lor  redundancy  is  lOO'.^;  that  is,  lour  or 
more  times  larger  than  that  of  typical  parity  encoded  arrays.  Furthermore,  recent  studies  |Stodol- 
.sky^.^.  Menon92a.  RosenblumOl )  have  demonstrated  techniques  that  allow  the  small-write  per- 
Ibmiance  of  parity-based  arrays  to  approach  that  of  mirroring.  This  paper,  theielore.  locuses  on 
parity-ba.sed  arrays,  but  includes  compari.sons  to  mirroring  where  meaninglul. 

We  do  not  recommend  on-line  failure  recovery  lor  applications  that  can  tolerate  oll-lme 
recovery,  since  the  latter  re.stores  high  performance  and  high  data  reliability  more  quickly  For  this 
rea.son.  we  focus  the  di.scu.ssion  and  analysis  in  this  paper  around  OLTP  applications;  these  clearly 
benelit  from  the  failure-mode  performance  improvements  that  are  the  primary  topic  of  this  paper. 
Application  areas  with  very  different  workload  characteristics,  multimedia  tor  example,  can  also 
benelit  from  improved  reliability  and  availability  in  the  storage  subsystem.  We  defer  the  evalua¬ 
tion  of  the  propo.sed  techniques  under  such  applications  to  future  work. 

Section  2  of  this  paper  provides  background  on  redundant  di.sk  arrays.  Section  introduces 
parity  declustering  and  Section  4  describes  data  layout  .schemes  for  implementing  parity  dechts- 
tering.  Section  .5  de.scribes  the  performance  evaluation  environment.  Section  h  de.scribes  alterna¬ 
tive  reconstruction  algorithms,  techniques  u.sed  to  recover  data  lost  when  a  disk  fails.  .Section  7 
then  pre.sents  performance  evaluation.  The  lirst  part  of  this  .section  compares  the  perlormance  ol  a 
declustered-parity  array  to  that  of  an  equivaleni-si/ed  multiple-group  RAID  Level  .S  array,  anil  the 
second  part  investigates  the  trade-off  between  di.sk  capacity  overhead  and  lailure-recovery  perlor¬ 
mance  in  a  declustered-parity  array.  Section  X  de.scribes  and  evaluates  a  set  of  modilications  that 
can  bo  applied  to  the  reconstruction  algorithm.  Section  di.scu.s.ses  techniques  for  selecting  a  sys¬ 
tem  configuration  ba.sed  on  the  requirements  of  the  environment  and  application.  Section  10  sum- 
mari/.es  the  contributions  of  this  paper  and  outlines  interesting  issues  for  future  work. 

2.  Redundant  disk  arrays 

Patterson.  Gib.son.  and  Kat/.  IPatlersonXXj  pre.sent  a  taxonomy  ol  redundant  disk  array  archi¬ 
tectures,  RAID  Levels  1  through  .S.  Of  ihe.se.  RAID  Level  is  best  at  providing  large  amounts  ol 
data  to  a  single  requestor  with  high  bandwidth,  while  RAID  Levels  1  and  s  are  most  appropriate 
for  highly  concurrent  access  to  shared  Hies.  The  latter  are  preferable  for  OLTP-class  applications. 
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Fifjure  1.  Disk  array  architectures 

since  OLTP  is  olten  characieri/ed  by  a  large  number  ol  independent  processes  conciirrenily 
requesting  access  to  relatively  small  units  ol'  data  |TPCAK9|.  For  this  reason  and  because  ol  the 
relatively  high  cost  of  redundancy  in  RAID  Level  1  arrays,  this  paper  focuses  on  architectures 
derived  from  the  RAID  Level  5  organization. 

Figure  la  and  Figure  lb  illustrate  two  po.ssible  disk  array  subsystem  architectures.  Today  s 
systems  u.se  the  architecture  ol  Figure  la.  in  which  the  disks  are  connected  via  inexpensive,  low- 
bandwidth  (e.g.  SCSI  (ANSIX6()  links  to  an  array  controller,  which  is  connected  via  one  or  more 
high-bandwidth  parallel  bu.ses  (e.g.  HIPPI  1ANSI911)  to  one  or  more  host  computers.  .Array  con¬ 
trollers  and  disk  bus.ses  are  often  duplicated  (indicated  by  the  dotted  lines  in  Figure  1 )  so  that  they 
do  not  repre.sent  a  .single  point  of  failure  [MenonS).^!.  The  controller  lunctionaliiy  can  al.so  be  dis¬ 
tributed  amongst  the  disks  of  the  array  (Cao‘L^|. 

As  di.sks  get  smaller  lGib.son921.  the  large  cables  u.sed  by  SCSI  and  other  bus  iiuerltices 
become  increasingly  unattractive.  The  system  sketched  in  Figure  lb  oilers  an  alternative.  It  uses 
high-bandwidth  .serial  links  lor  disk  interconnection.  This  architecture  scales  to  large  arrays  more 
easily  becau.se  it  eliminates  the  need  for  the  array  controller  to  incorporate  a  large  number  of 
string  controllers.  While  .serial-interface  disks  are  not  yet  common,  standards  for  them  are  emerg¬ 
ing  (PI394  1IEEE93I,  Fibre  Channel  (Fihre91 1.  DQDB  | IEEEX9|).  As  the  cost  of  high-bandwidih 
.serial  connectivity  is  reduced,  architectures  similar  to  that  of  Figure  lb  may  supplant  today  s 
short,  parallel  bus-ba.sed  arrays. 
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Figure  2;  Dalu  layout  in  a  5-(Jisk  array  employing  the  letl-symmeiric  RAID  Level  5  organi/atioii. 

In  both  organi/alions.  the  array  eonlroller  is  responsible  lor  all  syslem-relaied  aeliviiy:  con¬ 
trolling  individual  disks,  maintaining  redundant  inlormalion.  executing  rei)uesied  iraiislers.  and 
recovering  from  disk  or  link  lailures.  The  lunctionality  ol  an  array  controller  can  also  be  imple¬ 
mented  in  software  executing  on  the  subsystem's  host  or  hosts.  The  algorithms  and  analyses  pre¬ 
sented  in  this  paper  apply  to  all  array  controller  implementations. 

Figure  2  shows  an  arrangement  of  data  and  parity  on  the  disks  ol  an  array  using  the  "leli- 
symmetric"  variant  of  the  RAID  Level  5  architectures  |Cheny()b.  Leedl),  Logically  contiguous 
user  data  is  broken  down  into  blocks  and  striped  across  the  disks  to  allow  lor  concuireni  access  by 
independent  processes  lLivnyX7|.  The  shaded  blocks.  labelled  Pi.  store  the  parity  (cumulative 
exciusive-or)  computed  over  corresponding  data  blocks,  labelled  Di.O  through  Dr.'.  .\n  individ¬ 
ual  block  is  called  a  data  unit  if  it  contains  u.scr  data,  a  [Hiritx  iiiiil  if  it  contains  jiaiity.  and  simply 
a  unit  when  the  data/parity  distinction  is  not  pertinent.  A  .set  of  data  units  ainl  their  conv.spomling 
parity  unit  is  referred  to  as  a  puritx  stripe . 

•Since  every  update  to  a  data  unit  implies  that  a  parity  unit  must  also  be  updated,  small  write 
operations  rec|uire  four  disk  operations:  pre-read  and  write  of  the  data  to  compute  w  hich  bits  m 
the  data  unit  have  been  toggled,  followed  by  a  pre-read  and  write  of  the  parity  tmii  lo  toggle  the 
corresponding  bits.  To  avoid  contention  for  a  single  parity  disk,  the  a.ssignment  of  parity  blocks  to 
di.sks  rotates  acro.ss  the  array.  As  Section  6.2  di.scu.s.se.s.  the  unit  of  data  striping,  the  unit  ol  iiaiitv 
rotation,  and  the  unit  of  reconstruction  acce.ss  need  not  be  all  the  same.  In  particular,  the  unit  ol 
data  striping  should  be  determined  by  the  array's  expected  workload  K'hendObl. 

Becau.se  disk  failures  ;tre  detectable  (Palter.sonXS.  (iib.son93|.  .irrays  ol  di.sks  constittite  an 
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erasure  channel  |Pelersi)n72|.  and  so  a  parity  et)de  can  correel  any  siiiele  disk  lailure.  To  see  this, 
assume  that  disk  number  two  has  tailed  and  simply  note  that 

( Pi  =  Di.O  ©  Di.  I  ©  Di.2  ©  )  =>  ( Di.2  =  DiJ)  ©  Di.  I  ®  Pi®  Di..^ )  . 

An  array  containing  a  failed  disk  can  be  restored  to  its  lauh-lree  slate  by  successively  recon¬ 
structing  each  block  of  the  failed  disk  and  storing  it  on  a  replacement  drive.  This  is  generally  per¬ 
formed  by  a  background  process  in  either  the  host  or  the  array  controller.  .Note  that  an  array  need 
not  be  taken  off-line  to  implement  the  reconstruction  ot  a  tailed  disk,  becau.se  reconstruction 
accesses  can  be  interleaved  with  irser  acce.s.ses  to  data  on  non-tailed  disks,  and  because  user 
accesses  to  data  on  the  failed  di.sk  can  be  serviced  '■on-the-tly"  by  immediate  reconsiruclion  ol  the 
indicated  unit(s).  Once  reconstruction  is  complete,  the  airay  can  again  tolerate  the  loss  of  any  sin¬ 
gle  disk,  and  .so  is  again  fault-free,  albeit  with  a  dimini.shed  number  ol  on-line  spare  disks  until  the 
faulty  drives  can  be  physically  replaced,  (lib.son  and  Patterson  |(}ibson‘^),^l  show  that  a  small  num¬ 
ber  of  spare  disks  suflice  to  provide  a  high  degree  of  protection  against  data  lo.ss  m  relatively  large 
arrays  (>7()  disks).  Although  the  above  organization  can  be  easily  e.xtended  to  tolerate  tiuiliiple 
disk  failures,  this  paper  focuses  on  single-tailure  toleration. 

3.  Parity  declustering" 

The  RAID  Level  5  organization  pre.sents  two  problems  lor  commuous-opeitition  systems  like 
OLTP.  First,  the  load  increa.se  e.xpcrienced  by  surviving  drives  in  the  presence  I'l  a  disk  lailure  is 
severe.  Specitically.  each  user  read  operation  that  requests  data  Irom  the  tailed  drive  invokes  a 
read  operation  on  every  other  disk  in  the  group,  and  so  the  read  load  increase  m  the  presence  ol 
lailure  is  KM)''/? .  Similarly  a  u.ser  write  operation  to  a  tailed  data  unit  must  invoke  a  read  on  ever\ 
other  drive  in  order  to  be  able  to  compute  the  new  parity  lor  the  targeted  parity  stripe,  I'liis 
changes  the  four  acce.s.ses  normally  needed  to  perform  the  write  into  one  access  pet  sur\ivmg 
drive,  and  hence  the  write  load  increa.se  in  the  pre.sence  ol  lailure  is  25'/; The  easiest  way  to 

2.  Parity  declustcring  is  also  known  as  ('Instcrcd  RAID  We  preler  llie  lornier  term  as  it  lollows  ilie 
usage  in  earlier  work  on  mirrored  arrays  (TeradalaX5.  LivnySV,  t'opeiand(SO|  where  user  data  .md 
redundancy  information  are  "tleclusiered"  over  more  than  llie  minimal  collection  oi  disks 
.L  Tlie  wnte-load  increase  is  not  in  tact  25'd  because  when  a  user  writes  data  lor  wlucli  die  corre¬ 
sponding  parity  has  tailed,  no  parity  update  is  performed.  Ttiis  means  llial  some  accesses  m 
degraded  mode  do  less  work  than  they  would  in  lault-lree  mode  lNgd2l.  Tins  ellecl  is  inverseK 
proponional  to  die  size  of  the  array  (O.  and  is  small  lor  die  array  sizes  we  consider  in  this  paper, 
and  so  we  neglect  it. 


understand  this  is  U)  consider  a  hypothetical  user  workload  that  sends  r  read  recjuesis  and  u  write 
requests  to  each  disk  in  the  array.  In  I'ault-t'ree  mode,  each  user  write  request  translates  into  lour 
aeee.sses.  and  so  each  disk  .sees  a  total  workload  or/+4vi-  acees.ses.  In  the  presence  ol  disk  lailure. 
this  load  increa.ses  to  2r+5n  acees.ses,  indicating  that  read  workload  has  doubled  and  write  work¬ 
load  has  increa.sed  by  25'^ .  For  a  workload  emphasi/ing  small  acce.s.ses  and  consisting  ol  X()o; 
reads  on  a  4()-disk  array,  this  evaluates  to  an  overall  load  increase  ol'  about  60';^ . 

If  a  spare  disk  is  available  for  a  reconstruction  proce.ss  to  rebuild  lost  data  onto,  then  surviv¬ 
ing  di.sks  must  al.so  bear  this  additional  load.  This  load  increa.se  experienced  by  the  survi\ing 
di.sks  during  recon.struclion  necessitates  that  each  disk's  fault-free  load  be  light  enough  ihal  the 
surviving  disks  will  not  saturate  when  a  failure  occurs.  Disk  saturation  is  in  general  unacceptable 
becau.se  most  applications  mandate  a  minimum  level  of  responsiveness;  the  TPC'-A  benchmark 
|TPCA8y|,  for  example,  requires  that  of  all  transactions  complete  in  under  two  seconds. 
Long  queueing  delays  cau.sed  by  disk  saturation  can  violate  lhe.se  requirements. 

The  .second  problem  with  RAID  Level  arrays  is  that  at  moderate  to  high  u.ser  workloads, 
they  require  a  relatively  long  period  of  time  to  recover  from  a  failure:  that  is.  to  reconstruct  the 
entire  contents  of  a  failed  drive  and  store  it  on  a  replacement.  This  is  becau.se  the  load  increase 
a.s.sociaied  with  the  failure  can  cati.se  even  a  moderately  loaded  array  to  approtich  saturation. 
When  this  occurs,  little  disk  bandwidth  is  available  lor  reconstruction,  and  so  the  process  ol 
recovering  the  data  lakes  a  long  time.  During  this  period  of  time  the  arra\  is  both  operating  at 
reduced  performance  and  vulnerable  to  data  lo.ss  due  to  a  second  lailure.  and  so  it  is  essential  that 
the  reconstruction  period  be  minimi/.ed. 

The  (h'cliisrcrcd  paritx  |Munt/.9().  HolIand*'>2.  VlerchaniP2.  .\g‘i21  disk  array  organi/ation 
addre.sses  the.se  problems.  For  a  given  number  of  di.sks.  ('.  a  declustered  parity  orgam/tuion 
allows  the  failure-induced  Uiad  mcrea.se  on  the  surviving  di.sks  to  be  reduced  by  any  integrtil  lac- 
tor  between  2  and  C-1.  inclusive.  This  is  achieved  by  increasing  the  amount  ol  reduiuhint  inlor- 
mation  stored  in  the  array,  and  .so  it  can  he  thought  ol  as  trading  some  ol  an  airat  s  datti  capaciit 
for  improved  performance  in  the  prc.sence  of  di.sk  failure. 


Referring  again  to  Figure  2.  note  that  each  parity  unit  protects  f'-l  tlaitt  iiiiit.s.  where  ('  is  the 
number  of  disks  in  the  array.  If  instead  the  array  were  organized  such  that  etich  parity  unit  pro- 


Figure  3;  Doclusioriiig  a  parity  stripe  ol  si/.c  lour  over  an  array  ol  seven  disks. 


teeted  some  smaller  number  ol  data  units,  say  (/-I.  then  more  ol  the  array  s  eapaeily  would  he 
eonsumed  by  parity,  but  the  reconstruction  ol  a  single  data  unit  would  requiie  that  the  host  or  con¬ 
troller  read  only  G-1  units  instead  ol  C-\.  As  illustrated  in  Figure  parity  declusiering  can  also 
he  viewed  as  the  distribution  ol  the  parity  stripes  comprising  a  logical  RAID  Level  .s  array  on  (i 
disks  over  a  .set  of  C  physical  disks.  The  advantage  of  this  reaiTangement  is  that  not  every  surviv¬ 
ing  disk  is  involved  in  the  reconstruction  of  a  particular  data  unit;  C-C  disks  are  left  free  to  do 
other  work.  Thus  each  surviving  disk's  degraded-mode  load  is  mulliplied  by  a  laclor  i(T  1  )/(C- 1 1. 
relative  to  RAID  Level  .s.  The  fraction  {(/-\)/{C-\)  is  releiTcd  to  as  die  ilccliisicniii:  ratio,  and  is 
denoted  by  a.  More  specilically.  parity  declustering  reduces  the  degraded-mode  workload 
mcrea.se  due  to  user  reads  from  a  factor  of  2.0  to  a  factor  of  l-t-cx.  and  ihe  workload  increase  due  to 
writes  from  a  factor  of  1.25  to  a  factor  of  l-i-0.25(x. 

The  declustering  ratio  can  he  made  smaller  either  by  increasing  (’  for  a  li.xed  (/  as  shown  m 
Figure  ?<.  or  by  decreasing  G  for  a  lixed  C.  As  (x  is  made  smaller,  performance  during  failure 
recovery  improves  since  the  load  increa.se  on  each  surviving  disk  diminishes,  hut  more  ol  die 
array's  capacity  is  consumed  by  parity.  Many  of  the  performance  plots  in  suhsequeni  sections  are 
presented  with  a  on  the  x-axis. 

Whci,  G  =  2  (the  minimum  allowable  value)  declu.siered  parity  reduces  to  mirroring,  since 
the  parity  unit  for  each  parity  stripe  is  computed  as  the  XOR  over  only  one  data  unit.  .Note  how  - 
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ever  that  since  the  array  consists  of  a  large  number  of  parity  stripes,  the  mirror  copy  of  each  disk 
is  distributed  over  the  array  rather  than  being  localized  to  one  disk.  Thus  parity  declusiering  with 
G  =  2  is  e.ssentially  the  same  as  interleaved  iJeclusterinft  (a  technique  for  distributing  the  backup 
copies  in  arrays  of  mirrored  disks  [TeradalaXS.  CopelandS^.  Hsiao^l  |),  the  only  difference  being 
in  the  mechanism  u.sed  to  select  the  disks  upon  which  the  backup  copy  of  each  data  unit  resides. 
At  the  other  extreme.  G  =  C  (a  =  1.0).  parity  declu.stering  is  equivalent  to  RAID  Level  .S.  Thus 
parity  declustering  can  be  .seen  as  delining  a  continuum  of  design  points  between  RAID  L  'vei  5 
and  mirroring,  with  the  capacity  overhead  being  increa.sed  and  the  failure-mode  performance 
being  improved  as  G  is  reduced. 

A  few  other  .studies  have  looked  at  improving  failure  recovery  performance  via  techniques 
.similar  to  parity  declustering.  Teradata  (TeradataX5|  defined  and  implemented  interleaved  declus¬ 
tering  for  mirrored  disks,  which  was  sub.sequently  evaluated  by  Copeland  and  Keller  |Cope- 
landXy).  Muntz,  and  Lui  (Muntz9()|  lirst  propo.sed  applying  declustering  to  parity-ba.sed  arrays, 
but  left  open  the  problem  ol  implementation,  specilically  appropriate  data  layotiLs.  \g  and  .Vlatt- 
.son  (Ng921  developed  a  data  layout  .solution  concurrently  with  the  research  reported  in  this  paper, 
u.sing  e.s.seniially  the  .same  technique  as  is  de.scribed  in  Section  4.  Our  paper  provides  a  more  thor¬ 
ough  treatment  of  many  implementation  i.s.sues.  but  does  not  addre.ss  one  interesting  i.ssue  men¬ 
tioned  by  Ng  and  Matlson;  the  interaction  of  parity  declustering  with  distrihuted  sparing 
lMenony2b|.  We  believe  this  topic  merits  lurther  examination.  Reddy  and  Baimerjee  (Redd\‘M  | 
also  propo.sed  a  technique  for  implementing  a  form  a  parity  declu.stering  where  the  declustering 
ratio  is  lixed  at  approximately  0.5.  Merchant  and  Yu  lMerchant921  de.scribed  a  suhsiantially  dil- 
ferent  but  equivalent-performance  implementation  of  parity  declu.stering.  which  we  discuss  m 
detail  in  Section  4.3. 

4.  Disk  array  data  layout  for  parity  declustering 

In  mo.st  di.sk  array  systems,  the  array  controller  (whether  implemented  in  hardware  or  as  a 
device  driver  in  the  host  operating  sy.stem)  implements  an  abstraction  ol  the  anay  as  a  linear 
addre.ss  space.  A  di.sk-managing  application  such  as  a  file  system  views  the  disk  array's  data  units 
as  a  linear  .sequence  ol  di.sk  .sectors  that  can  he  read  or  written.  Parity  units  typically  do  not  appear 
in  this  addre.ss  space;  that  is.  they  are  not  addre.s.sable  by  the  application  program.  The  iirray  con- 


X 


irollor  translates  addresses  in  this  user  space  inU)  physical  disk  locations  (disk  ideiuiliers  and  disk 
otTsels)  as  it  pert'orms  requested  accesses.  It  is  al.so  responsible  lor  perlorniine  the  ivdundancs- 
maintaining  acce.s,ses  implied  by  application  write  acce.s.ses.  This  mapping  oi  an  application  s  log¬ 
ical  unit  of  stored  data  to  physical  disk  locations  and  a.s.socialed  parity  locations  is  rolerred  to  as 
the  di.sk  array’s  layout.  In  this  .section  we  di.scu.ss  goals  for  a  disk  array  la>i>ul.  present  a  la\  out  lor 
declustered  parity  ba.sed  on  balanced  incomplete  block  designs,  and  contrast  u  to  a  layout  pro- 
po.sed  by  Merchant  and  Yu  [Merchanb)2|  which  supports  more  conliguralions  ol  large  arrays  at 
the  co.si  of  higher  complexity. 

4.1.  Layout  goodness  criteria 

Extending  from  non-declu.slered  disk  array  layout  re.search  |Lee^)().  Dibble*^)*)!.  we  have  iden- 
tilied  six  criteria  for  a  good  disk  array  layout. 

1.  Sinf^le  failun'  a)rrectiny.  No  two  stripe  units  in  the  same  parity  stripe  may  reside  on  the  same 
phy.sical  disk.  This  is  the  basic  characteristic  of  any  singlc-tailiire-loleraling  redundancy  orga¬ 
nization.  In  arrays  in  which  groups  ol  disks  have  a  common  lailure  mode,  such  as  pouer  or 
data  cabling,  this  criteria  should  be  extended  to  prohibit  the  allocation  ot  units  irom  one  parity 
stripe  to  two  or  more  disks  sharing  that  common  lailure  mode  (.SchulzeS'^  (Jibson‘>e| 

2.  Distributed  recovery  workload.  When  any  di.sk  fails,  its  user  workload  should  be  e\enl\  dis¬ 
tributed  acro.ss  all  other  disks  in  the  array.  When  replaced  or  repaired,  its  reconsiruciion  work¬ 
load  should  al.so  he  evenly  distributed. 

3.  Distributed  parity.  Parity  intormation  should  be  evenly  distributed  across  the  arra\  to  halaiise 
parity  update  load. 

4.  Efficient  mapping.  The  tunctions  mapping  a  tile  system  s  logical  block  address  to  |ili\sical 
di.sk  addre.s.ses  for  the  corresponding  data  unit  and  parity  stripe,  and  the  appropriate  iiueise 
mappings,  mu.st  be  etiiciently  implementable;  they  should  consume  neither  excessive  compu¬ 
tation  nor  memory  re.sources. 

.5.  Larye  write  optimization.  The  layout  should  en.sure  that  when  .i  user  pertorms  a  write  ill. it  is 
the  size  of  the  data  portion  of  a  parity  stripe  and  starts  on  a  parity  stripe  boundaix.  it  is  possildc 
to  execute  the  write  without  pre-reading  the  prior  contents  ol  any  ilisk  data  Since  the  new  p.u 
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ity  unit  depends  only  on  the  new  data,  this  criterion  requires  that  it  be  possible  to  simply  com¬ 
pute  the  new  parity  in  memory  and  write  it  to  the  appropriate  disk  location.  Another  way  ol 
stating  this  criterion  is  that  the  allocation  ol  contiguous  user  data  to  disk  data  units  should  cor¬ 
respond  to  the  allocation  ol  disk  data  units  to  parity  stripes. 

6.  Maximal  parallelism.  A  read  of  contiguous  u.ser  data  with  size  equal  to  a  data  unit  limes  the 
number  ol  disks  in  the  array  should  induce  a  single  data  unit  read  on  all  di.sks  in  the  array 
(while  requiring  alignment  only  to  a  data  unit  boundary).  This  insures  that  maximum  parallel¬ 
ism,  and  iherelore  minimum  respon.se  time,  can  be  obtained. 

Criterion  six  should  not  be  interpreted  as  placing  constraints  on  the  size  ol  the  data  unit  in  the 
array:  it  makes  recommendations  only  about  the  a.ssignmenl  ol  consecutive  data  units  to  di.sks. 
Using  more  than  one  di.sk  to  .service  a  read  operation  increa.ses  the  positioning  overhead  (cumula¬ 
tive  .seek  time  and  rotational  delay)  incurred  by  the  read,  but  reduces  the  data  iranster  lime.  11  the 
amount  of  data  tran.slerred  from  each  dnve  is  relatively  small,  and  other  requesLs  are  wailing  to 
acce.ss  the  array,  then  the  parallel  transfer  of  the  access  will  lead  to  signilicanily  lower  ihroughpiii 
becau.se  of  this  extra  positioning  overhead.  In  this  ca.se.  higher  throughput  would  be  achieved  by 
.servicing  multiple  accesses  concurrently,  with  each  acce.s.ses  using  fewer  drives.  However  il  a 
very  large  read  is  .serviced  by  a  small  number  of  disks,  the  respon.se  time  of  the  read  will  be  wry 
long  due  to  the  lack  of  parallel  data  transfer.  Therefore,  the  stripe  unit  size  should  be  selected 
according  to  the  characteristics  of  the  expected  workload  |rhen‘.>Ob|.  and  the  layout  policy  should 
not  intluence  this  selection. 

The  best  way  to  understand  the  value  of  criterion  six  is  to  consider  the  ramilications  of  disre¬ 
garding  it.  After  the  characteristics  of  the  expected  workload  have  been  u.sed  to  determine  the 
appropriate  data  unit  .size,  it  may  .still  be  the  ca.se  that  there  occur  some  u.ser  accesses  large  enough 
to  span  all  the  di.sks  in  the  array.  If  criterion  six  is  ignored,  the  data  units  of  a  very  large  contigu¬ 
ous  read  could  be  allocated  over  a  po.s.sibly  .small  sub.sei  of  the  disks.  (This  is  consistent  with  cri¬ 
terion  live  if  G  is  much  smaller  than  C.)  This  could  render  the  lile  system  or  application  program 
unable  to  achieve  high  transfer  bandwidth  even  for  very  large  contiguous  reads,  and  so  the 
rc.spon.se  time  oflhe.se  reads  would  be  many  times  longer  than  necessary.  Criterion  six  provides  a 
very  simple  model  for  (ilc  systems  and  applications  to  ensure  last  transfer  for  large  obiects. 
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Finally,  note  that  the  (irst  lour  criteria  deal  exclusively  with  relationships  heiween  stripe  iiniis 
and  parity  stripe  membership,  while  the  last  two  make  recommendations  lor  the  relationship 
between  user  data  allocation  and  parity  stripe  organization.  A  lile  system  is.  ol  course,  not 
required  to  allocate  contiguous  u.ser  data  contiguou.sly  in  the  array  s  address  space.  In  this  sen.se 
the  array  controller  has  no  direct  control  over  whether  or  not  the  last  two  criteria  are  always  met. 
even  if  it  is  implemented  as  a  device  driver  in  the  host.  The  be.st  that  can  be  done  is  to  meet  these 
last  two  criteria  for  data  units  that  are  contiguous  in  the  address  space  of  the  aiTay. 

4.2.  Layouts  based  on  balanced  incomplete  block  designs 

The  primary  goal  in  de.signing  a  layout  strategy  for  parity  declustering  is  to  meet  the  second 
goodne.ss  criterion:  every  surviving  di.sk  in  the  array  .should  absorb  an  equivalent  fraction  of  the 
total  extra  workload  induced  by  a  failure,  including  both  acce.s.ses  invoked  by  u.sers  and  recon- 
.struction  acces.ses.  An  equivalent  formulation  is  that  the  same  number  of  units  be  read  from  each 
surviving  disk  during  the  reconstruction  of  a  failed  di.sk.  This  will  be  achieved  if  the  total  number 
of  parity  stripes  that  include  a  given  pair  of  di.sks  is  constant  acro.ss  all  pairs  of  disks,  that  is.  if 
di.sks  number  /  and  j  appear  together  in  a  parity  .stripe  exactly  n  times  for  any  i  and  j.  where  n  is 
.some  fixed  constant.  As  sugge.sied  by  Muntz  and  Lui.  a  layout  with  this  properly  can  he  derived 
from  a  balanced  incomplete  block  design  (HallX6|.  This  .section  shows  how  such  a  layout  mtiy  be 
implemented. 

A  block  de.sign  is  an  arrangement  of  v  di.siinct  objects  into  b  luple.s'^,  each  containing  k  ele¬ 
ments.  such  that  each  object  appears  in  exactly  r  tuples,  and  each  pair  of  objects  appears  in 
exactly  luplc.s.  For  example,  using  non-negative  integers  as  objects,  a  block  design  with  h  -  5. 
V  =  5,  k  =  4,  r  =  4.  and  Xj,  =  3  is  given  in  Table  I . 

This  example  demonstrates  a  .simple  form  of  block  design,  called  a  complete  block  desi:^ii. 
which  includes  all  combinations  i)l  exactly  k  di.stincl  elements  .selected  from  the  set  of  e  objects. 
The  number  of  the.se  combinations  is  [^’j.  Note  that  only  three  of  v,  k.  h.  r.  and  Xp  are  free  vari¬ 
ables  .since  the  following  two  relations  are  always  true:  bk  =  vr.  and  r(k-l )  =  Xjy(\'-I).  The  lirst  of 

4.  Thc.se  luplc.s  arc  called  hlfH:ks  in  the  block  design  literature.  We  avoid  iliis  name  as  it  conllicls 
with  the  commonly  held  delinition  of  a  block  as  a  contiguous  chunk  of  data.  Similarly  we  use  k^, 
in.stcad  of  the  usual  X  Ibr  the  number  of  tuples  containing  each  pair  ol  Objects  to  avoid  con/lici 
with  the  common  usage  ofX  as  the  rate  of  arrival  of  user  acccs.se.s  at  the  array. 
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Tuple  Number 

Tuple 

0 

0.  1.  2.  3 

1 

0.  1,2.4 

2 

0.  1.  .3.4 

0.  2.  3. 4 

4 

1.2.  3,4 

Table  1:  A  sample  block  design  on  live  objects  with  lour  objects  per  tuple. 


onset 

Dl.SKO 

DISK! 

DISK2 

DISK3 

D1SK4 

0 

DO.O 

DO.l 

D0.2 

PO 

1 

Dl.O 

Dll 

DI.2 

D2.2 

n 

D2.0 
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D3.2 
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D4.0 
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Figure  4;  Example  data  layout  in  a  declustered  parity  organi/ation 

these  relations  counts  the  objects  in  the  block  design  in  two  ways,  and  the  second  counts  the  pairs 
in  two  ways. 

The  layout  a.s.sociates  disks  with  objects  and  parity  .stripes  with  tuples.  For  clarity,  the  loilow- 
ing  di.scu.s.sion  is  illustrated  by  the  construction  of  the  layout  in  Figure  4  Irom  the  block  design  in 
Table  I.  To  build  a  layout,  we  find  a  block  design  with  v  =  C.  k  =  G.  and  the  minimum  possible 
value  lor  h.  The  mapping  idcntilies  the  elements  ol  a  tuple  in  a  block  design  with  the  disk  num¬ 
bers  on  which  each  succe.s.sive  stripe  unit  ol  a  parity  stripe  is  allocated.  In  Figure  4.  the  lirst  tuple 
in  the  dc.sign  of  Table  1  is  u.sed  to  lay  out  parity  stripe  0;  the  three  data  blocks  in  parity  stripe  0  are 
on  di.sks  0.  1.  and  2,  and  the  parity  block  is  on  di.sk  3.  Ba.sed  on  the  .second  tuple,  stripe  1  is  on 
di.sks  0.  1.  and  2.  with  parity  on  disk  4.  In  general,  .stripe  unit  j  of  parity  stripe  i  is  a.ssigned  to  the 
lowest  available  off.set  on  the  disk  identified  by  the  element  of  tuple  /  mod  h  in  the  block 
design. 

It  is  apparent  from  Figure  4  that  this  approach  produces  a  layout  that  violates  the  distributed 
parity  criterion  (3).  To  resolve  this  violation,  we  duplicate  the  above  layout  G  times  (lour  times 
for  the  example  in  Figure  4),  a.s.signing  parity  to  a  different  element  of  each  tuple  in  eac*'  duplica¬ 
tion.  as  .shown  in  Figure  5.  This  layout,  the  entire  contents  of  Figure  5.  is  further  duplicated  until 
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Figure  5:  Full  block  design  table  lor  a  parity  decliislering  organi/alioii. 


all  stripe  units  on  each  disk  are  mapped  to  parily  stripes.  We  reler  lo  one  iieraiioii  oi  iliis  layout 
(the  tirsi  four  blocks  on  each  disk  in  Figure  5)  as  a  hhn  k  design  Kihfc.  and  one  complete  cycle  (all 
blocks  in  Figure  5)  as  d  full  block  (Icsi;^n  table. 


Of  course,  if  the  block  design  has  a  very  large  number  of  tuples,  then  the  si/e  of  one  full  table- 
can  exceed  the  st/.e  of  the  array.  This  results  in  violations  of  criteria  two  and  three.  Hence,  it  is 
nec -ssary'  to  lind  an  appropriately  small  design  lor  each  combination  ol  C  and  (/. 


It  IS  easy  lo  verily  that  the  layout  of  Figure  5  meets  the  lirsi  four  of  the  criteria:  ( 1 1  No  two 
stripe  units  from  the  same  parity  stripe  will  be  a,s.signed  lo  the  same  disk  because  no  tuple  in  llie 
block  design  contains  the  same  element  more  than  once.  (2)  The  failure-induced  workload  is 
evenly  balanced  becau.se  each  disk  appears  together  with  each  other  disk  in  e.xaclly  parity 
stripes  in  one  block  design  table.  This  properly  implies  that  when  any  disk  fails,  exactly  stupe 
units  must  be  read  from  each  other  di.sk  in  order  to  reconstruct  the  missing  data  lot  that  table 
.Since  the  lailure-induced  workload  is  balanced  in  each  table,  it  is  balanced  over  the  entire  array. 
(  .^1  Parity  is  balanced  because  over  the  course  of  one  full  table,  parily  is  a.ssigned  lo  each  element 
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or  each  tuple  in  the  block  design  exactly  once  (refer  to  the  boxes  labelled  ‘  purity  "  in  Figure  5). 
Since  each  element  appears  exactly  Gr  limes  in  the  lull  table,  each  disk  is  assigned  a  parity  unit 
exactly  Gr  limes  over  the  course  of  the  full  table.  Again,  .since  parity  is  balanced  in  every  lull 
table,  it  is  balanced  over  the  entire  array.  (4)  While  it  is  not  guaranteed  that  a  block  design  will 
exist  for  every  pos.sible  combination  of  C  and  G.  nor  that  the  number  of  blocks  will  be  sullicienil> 
small  that  the  size  of  a  full  table  will  not  exceed  the  .size  of  the  array,  we  have  ideniilied  accept¬ 
able  block  designs  for  all  combinations  of  C  and  G  up  to  40  disks,  and  lor  many  of  the  pos.sible 
combinations  beyond’^.  Section  0  di.scu.s.ses  the  problem  of  designing  larger  arrays 

As  previously  mentioned,  criteria  live  and  six  are  dependent  on  the  assignment  ol  user  data 
units  to  units  in  the  address  space  of  the  array,  and  .so  a  data  layout  mechanism  can  not  guarantee 
that  they  will  be  met.  A.ssuming  that  this  u.ser  data  mapping  is  sequential,  that  is.  ih.it  successive 
blocks  of  u.ser  data  are  mapped  to  the  succes.sive  data  units  of  the  array  s  address  space,  the  .ibo\  e 
layout  meets  criterion  live  (the  large  write  optimization),  but  fails  to  meet  criterion  six  i  m.i.ximum 
parallelism).  To  .see  this,  note  that  since  con.secutive  u.ser  data  is  always  consecutive  within  a  par¬ 
ity  .stripe,  a  write  of  G- 1  u.ser  data  units  aligned  on  a  G- 1  unit  boundary  in  the  address  space  oi  the 
array  will  always  map  to  the  complete  .set  of  data  units  in  some  parity  stripe,  and  so  the  huge  w  rite 
optimization  can  be  applied.  However.  Figure  4  shows  that  reading  ('  (5.  in  this  case)  successive 
u.ser  data  units  starting  at  the  unit  marked  D(i()  results  in  disks  0  and  I  being  used  twice,  and  disks 
and  4  not  at  all.  and  hence  criterion  six  is  violated. 

•As  illustrated  in  Figure  b,  it  is  possible  to  meet  criterion  six  by  employing  .i  user-dat.i  map¬ 
ping  similar  to  Lee's  left-symmetric  layout  for  non-declustered  .irrays  jLeebl  |.  but  this  causes  the 
layout  to  violate  criterion  live.  This  mapping  works  by  a.ssigmng  each  successive  user  dat.i  block 
to  the  lirsl  available  data  unit  on  each  succe.ssivo  disk,  thereby  guaranteeing  that  criterion  six  is 
met.  It  cau.ses  criterion  live  ti’  he  violated  becau.se  .successive  user  data  blocks  may  be  .issigned  to 
differing  parity  stripes. 

.Since  typical  OLTP  transactions  access  data  in  small  units  |TP('AS‘)|.  large  .iccesses  account 
for  a  small  fraction  of  the  workload,  typically  arising  from  decision-support  or  .irray-mamienance 

5.  Wc  arc  coastructing  a  database  ol  block  designs  vlenved  Irom  ihe  sources  describeil  in  Section 
4.4.  At  the  time  of  publication,  this  database  is  available  via  anonymous  lip  irom  lip  es  eimi.edu 
(internet  addre.ss  l2X.2.2()b.  ITt)  in  the  tile  project/nectar-io/Declustering/BD  ilaiabase  lar.Z 
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Figure  6:  Mooling  crtlorion  six  via  lell-syninicinc  panl\ -dcclusiL'ivd  Ia\iun, 

The  shows  the  jnintx  stripes  that  ore  uHoeated  h\  the  tost  r\eo  iterations  of  tin  hhxk 

(Jesi't;n  table,  with  data  units  mapped  in  the  stxle  of  Lee  s  left-sxinmetm  laxoiit  ILeeOl  /  For  i  lar- 
in,  the  data  units  are  marked  with  their  identifiers  in  the  address  spac  e  of  the  arrax.  rather  than 
their  parin'  stripe  ID  and  parin'  stripe  offset  as  in  Fissure  4  and  Figure  5.  Note  that  the  data  units 
in  parin'  ttroup  7  are  not  sequential  in  the  arrax  s  data  address  spin  e.  \o  c  riterion  pve  i\  violated. 


lunctiDns  rather  than  application  transactions.  Thus,  lor  OLTP  L'lnironmcnis.  a  niinonlN  ol  user 
acce.s.ses  touch  more  than  one  data  unit,  and  reads  that  access  a  numher  ol  data  umis  comparahle 
to  (’  are  rarer  .still  lRaniakri.shnan921.  Therefore  the  benelii  of  achieving  cnienon  six  in  the  lasmii 
would  be  marginal  in  the  OLTP  workloads  we  are  emphasi/ing.  However,  we  have  observed  that 
under  u.ser  workloads  where  large  reads  are  more  common,  the  failure  to  meet  criterion  six.  com¬ 
bined  with  the  fact  that  a  declu.siered  parity  array  mu.si  skip  over  more  parity  units  when  ser\  icing 
a  read  large  enough  to  acce.ss  multiple  data  units  from  multiple  disks,  cau.ses  the  response  time  oi 
the.se  large  reads  to  be  .signilicantly  longer  in  parity  declusienng  than  in  RAID  Level  .s.  lor  exam¬ 
ple.  We  defer  to  future  work  the  problem  of  simultaneously  meeting  both  criterion  live  and  crite¬ 
rion  six^. 

4.3.  Layouts  based  on  random  permutations 

Merchant  and  Yu  |Merchant92|  have  independently  developed  an  array  layout  strategy  lor 
declusiercd  parity  disk  arrays.  This  .section  brielly  describes  their  layout  strategy  and  compares  it 
to  the  block-de.sign  ba.sed  approach  developed  above. 


h.  We  note  that  one  promising  approach  to  improving  ihe  response  lime  ol  large  reails  woulil  he  to 
optimi/c  the  ordering  of  tuples  in  the  block  design  and  elements  in  each  tuple  in  order  to  maxmii/e 
the  adherence  to  criterion  six  without  giving  up  adherence  to  criterion  live. 
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Their  approach  dislribulcs  tailure-induced  workload  (crilcrion  two)  and  parily  (criterion 
three)  over  the  disks  in  the  array  by  randomizing  the  assignment  ol  data  and  parity  units  to  disks. 
The  layout  detines  a  linear  addre.ss  space  consisting  oi  units  numbered  0  through  BC- 1 .  where  B  is 
the  number  ol  units  on  a  disk  and  C  is  the  number  ol  disks  in  the  array.  Every  unit  in  ihi.s 
address  space  (units  number  G-1. 2G-1.  .-fC-l,  etc.)  contains  parity  lor  the  previous  (;-l  units,  ll 
the  assignment  ol'  the.se  units  to  disks  were  truly  random,  then  there  would  be  no  guarantee  that 
the  units  comprising  a  parity  stripe  all  reside  on  dilTerent  disks  (criterion  one).  Instead,  their  lay¬ 
out  u.ses  a  .set  ol  random  permutations  on  the  di.sk  identiliers  to  a.ssign  units  to  disks. 

Detine  a  .set  ol  random  permutations  ol  the  integers  Horn  0  to  T- 1  as  lollows:  P„.  the  //‘*‘  per¬ 
mutation  in  the  set.  maps  the  integer  a  to  where  {)<a<  C  and  0  <  <  C.  as  illustrated; 

I . . P„  r.,> 

To  map  the  location  ol  the  data  unit,  let  n  =  ii/Cj  and  j  =  i  mod  C.  The  physical  location  ol 
unit  i  is  olt'set  n  into  the  disk  with  identilier  P„j.  Thus  the  permutation  P^,  is  u.sed  to  identily  the 
disks  on  which  units  number /tC  through  ('/?+l)C-l  re.side. 

When  C  is  a  multiple  of  G.  no  parily  .stripe  will  .span  more  than  one  permutation.  .Since  the 
elements  of  each  permutation  arc  distinct,  the  units  comprising  a  parity  stripe  will  all  re.side  on 
different  di.sks.  and  so  criterion  one  is  met.  If  C  is  not  a  multiple  of  G.  then  using  each  permutation 
R  =  LCM(CG)/G  times  .sequentially,  where  LCMO  is  the  least-common-mtiltiple  ftinction. 
ensures  that  no  parity  stripe  spans  two  different  permutations,  again  meeting  the  needs  ol  criterion 
one.  The  fact  that  the  .set  of  permutations  u.sed  to  map  an  array  is  selected  randomly  implies  both 
that  parity  blocks  are  randomly  distributed,  and  that  each  parily  stripe  is  mapped  to  a  set  of  disks 
cho.sen  randomly  from  the  j  po.ssible  combinations,  ensuring  that  criteria  two  and  three  are  also 
met.  Criterion  four  is  met  as  long  as  the  permutation  P„  can  be  compiiied  cllicienily.  Merchant 
and  Yu  prc.seni  an  algorithm  for  this  that  operates  by  controlling  the  exchange  pha.se  of  a  series  o( 
applications  of  a  shuftle-exchange  network  with  random  bits  derived  from  a  linear-congriiential 
random  number  generator.  While  certainly  requiring  substantial  computation,  this  algorithm  s 
asymptotic  compuialittn  needs  grow  slowly  with  respect  to  C  and  G.  As  in  the  block-design  based 
layout  of  Figure  .5,  criteria  five  is  met  ar  1  six  is  violated  by  this  permuiation-ba.sed  layout. 

We  have  verilied  by  simulation  that  this  layout  yields  array  perldrmance  es.sentially  identical 
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to  that  of  the  block-design  based  layout.  The  advantage  of  this  aJgoriihn).  then,  is  that  ii  j.s  able  lo 
generate  a  layout  ft)r  arbitrary  C  and  G.  whereas  the  block  design  approach  is  limited  to  those 
combinations  of  C  and  G  for  which  a  design  can  be  found.  The  disadvantage  is  the  relatively  large 
amount  of  computation  a  host  or  controller  must  do  to  compute  a  physical  di.sk  address  everv  time 
a  unit  of  data  is  accessed.  By  vsay  of  contrast,  the  block-design  ba.sed  algorithm  computes  physi¬ 
cal  di.sk  addre.s.se.s  by  a  table  lookup  and  a  few  .simple  arithmetic  operations. 

4.4.  Choosing  between  layouts 

Complete  block  designs  such  as  the  one  in  Table  1  are  easily  generated,  hut  in  most  CLises 
they  are  too  large  to  be  u.seful.  The  number  of  blocks  in  a  complete  design.  .  is  in  general  so 
large  that  the  block-design-ba.sed  layout  fails  to  have  an  eflicieni  mapping.  For  example,  a  40  disk 
array  with  10'"^  parity  overhead  (G=I())  mapped  by  a  complete  block  design  will  haw  about  one 
billion  tuples  in  its  block  design  table.  In  addition  to  the  ridiculous  amount  of  memory  rei|uired  to 
.store  this  table,  the  layout  generated  from  it  will  meet  neither  the  distributed  parity  nor  distributed 
reconstruction  criteria  becaj.se  even  large  di.sk.s  rarely  have  more  than  a  lew  million  sectors.  For¬ 
tunately.  there  exists  an  extensive  literature  on  the  theory  of  IxildiK  cd  liuonipictc  hhx  k  </('s/g/is 
(BIBDs).  which  are  simply  designs  having  fewer  than  !^.  tuples. 

The  construction  of  BIBDs  is  an  active  area  of  re.search  in  combinatorial  theors'.  and  there 
exists  no  technique  that  allows  the  direct  construction  of  a  design  with  an  arbitrarily-specilied  set 
of  parameters.  Instead,  designs  are  generated  on  a  ca.se-by-ca.se  basis,  and  tables  of  known  designs 
lHanani75.  HallXb,  CheeSh).  Mathon^OI  are  published  and  periodically  updtited.  These  tables  are 
den.se  when  v  is  small  (le.ss  than  about  4,5).  but  become  gradually  sparser  as  increases,  flanani 
|Hanani751.  for  example,  gives  a  table  of  designs  that  can  be  ti.sed  to  generate  a  layout  lor  any 
value  of  G  given  C  not  larger  than  4.^.  and  for  many  combinations  with  larger  ('. 

.Since  the  block  design  approach  is  computationally  more  efiicient  than  the  nmdom-permuta- 
lion  approach,  we  recommend  that  it  be  u.sed  if  the  array  can  be  eonligured  using  values  of  ('  .md 
G  for  which  an  acceptably  .small  block  de.sign  is  known.  When  a  system's  goals  cannot  he  met 
u.sing  any  such  configuration,  then,  of  course  u.se  the  random-permutation  algorithm.  Section  ‘) 
di.scu,s.sc.s  the  problem  of  configuring  very  large  arrays. 
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Figure  7;  The  structure  i)t  raidSim. 

5.  Evaluation  methodology 

All  analyses  in  this  paper  were  done  using  an  event-driven  disk  array  sinuilalor  called  raid¬ 
Sim  IChenyOb.  Leeyil,  originally  developed  lor  the  RAID  pro|ect  at  U.('.  Berkeley  lKai/Xy|.  U 
consists  of  four  primary  components,  illustrated  in  Figure  7.  The  lop  level  of  ahsiraciion  contains 
a  synthetic  reference  yeneratar.  Table  2a  shows  the  workload  generated  lor  the  simulations.  This 
workload  is  based  on  access  statistics  measured  on  an  airline-re.servalion  OLTP  system 
[Ramakrishnany2|.  The  requests  produced  by  this  workload  generator  are  sent  to  a  RAID  stripnr^ 
driver,  whose  function  is  to  translate  each  user  request  into  the  corresponding  set  ol  disk  accesses. 
Table  2b  shows  the  conliguration  of  our  extended  version  ol  this  striping  driver.  Low-level  disk 
operations  generated  by  the  striping  driver  are  sent  to  a  disk  simitlarian  module,  which  acciiratels 
models  signiticant  aspects  of  each  specilic  disk  access  (seek  time,  rotation  time,  cylinder  layout, 
etc.).  Table  2c  shows  the  characteristics  of  the  .^14  MB.  .'-1/2  inch  diameter  IBM  Obb  I  Model  .0() 
(Lightning)  disks  on  which  the  simulations  are  ba.sed  |IBM()bbl  |.  At  the  lowest  level  ol  abstrac¬ 
tion  in  raidSim  is  an  event-driven  simulator,  which  is  invoked  to  cau.se  simulated  time  to  pass 

As  disks  gel  smaller  and  less  expensive,  and  as  systems  demand  increased  I/( )  rates,  the  num¬ 
ber  ol  disks  in  a  typical  array  will  increa.se.  For  (his  reason,  we  locus  our  sinuilations  on  array 
sizes  that  are  larger  than  are  common  today.  Specilically.  the  simulations  reported  in  subsequent 
.section.s  u.se  a  default  array  size  of  40  di.sks.  In  order  to  verify  that  our  conclusions  are  not  specilic 
to  a  particular  array  size,  we  also  ran  2()-disk  simulations  in  most  ca.ses.  The  periormance  ot  the 
20  di.sk  array  was  identical  to  that  of  the  40-di.sk  array  lor  a  given  u.ser  workload  measured  m 
acce.s.ses  per  second  per  di.sk.  and  .so  we  report  only  the  40-disk  results  here. 

All  reported  simulation  results  represent  averages  over  live  independently  seeded  simulaiu'ii 
runs.  In  all  ca.se.s.  thi.s  re.sulted  in  very  .small  conlidence  iniervaJ.s  (a  lew  percent  of  the  mean)  and 
so  the  performance  plots  in  sub.sequent  .sections  do  not  report  these  actual  intervals.  For  simula¬ 
tions  of  fault-free  and  degraded-mode  arrays  (refer  to  .Section  7).  the  simulation  was  not  lermi- 


Table  2a:  Workload  Parameters 


Access  ivpe  9<  of  w()rkl(nKJ  Opeauion 

1  m/,  Read 

2  Write 

3  27,  Read 

4  27  Write 


DisinhiHiDn 
rnilorni 
I  nilomi 

24  24  rmlonii 

24  24  Inilorni 


Si/c  (KB)  AlicnmeiK  ( KE^  i 
4  4 

4  4 


Number  ot  requesting  processes:  3  x  (number  ol  disks) 

Think  time  distribution:  Exponential,  with  mean  varied  to  adjust  oltered  load 


Array  si/e; 

-Stripe  unit  si/.e; 
Reconstruction  unit: 
Head  .scheduling: 
User  data  layout: 
Data/Parity  layout; 
Di.sk  spindles; 

Table  2b:  Array  Parameters 

40  di.sks 

24KB 

24KB 

FIFO 

,Sequential  u.serdata  ->  sequential  units  ol  sequential  pariiy  siripes 
Block-design  ba.sed 

Synchronized 

Geometry: 

Table  2c:  Disk  Parameters 

940  cylinders,  14  heads,  4X  .sector.s/track 

Sector  si/e; 

512  bytes 

Revolution  time: 

1 3.9  ms 

Seek  time  model: 

2.0  +  0.01  cv/.v  -1-  0.46  ■  Jcxis  (ms.  ( \7s  =  seek  distance  in  c\  linders- 1  i 

Track  skew: 

2.0  ms  min.  12.5  ms  average.  25  ms  max 

4  .sectors 

Cylinder  skew: 

17  sectors 

MTTF. 

1.50.000  hours 

nated  until  the  *‘)57  conlidence  interval  on  the  user  response  time  had  lallen  to  less  than  3'  -'  ol  the 
mean.  For  reconstruction-mode  runs,  the  simulation  was  terminated  at  the  completion  ot  lecoii- 
struction.  All  simulation  were  "warmed  up"  by  running  a  lew  accesses  belore  initialing  the  collec¬ 
tion  ol  statistics  lor  that  run. 

6.  Algorithms  for  lost  data  reconstruction 

A  reconstruction  ul^orithm  is  a  strategy  ii.sod  by  a  background  icconsiriiction  process  lo 
regenerate  data  resident  on  the  tailed  disk  and  store  it  on  a  replacement.  In  this  section  we  evalu¬ 
ate  two  such  algorithm.s,  and  then  report  on  a  study  investigating  the  cUccIs  ol  modilying  the  si/e 
ol  the  reconstruction  unit,  which  is  the  amount  ol  data  read  or  written  in  each  reconstruction 


in 


acce.ss. 


6.1.  Comparing  reconstruction  algorithms 


The  most  siraighllurward  approach,  which  we  term  the  sriipc-oiu  ntcd  alei'nlhrii.  is  as  lol- 
luws; 

lor  each  unit  on  the  tailed  disk 

1 .  Idenlily  the  parity  stripe  ti)  whtch  the  unit  hclones. 

2.  Issue  low-priority  read  requests  tor  all  other  units  in  stripe.  incliulmL'  llie  paiii\  iiiiil 
Wait  until  all  reads  have  cotitpleled. 

4.  Compute  the  exclusive-ot  over  all  untts  read. 

5.  Issue  a  low-priortty  write  request  to  the  replacement  disk 

6.  Wail  tor  the  write  to  complete, 
end 

This  algorithm  uses  low-priortly  requests  tn  order  to  mtnimi/e  the  impact  oi  lecoiisiruciioii 
on  u.ser  response  lime,  since  commodity  disk  drives  do  not  generally  support  aits  lorm  oi  [iivemp- 
live  access.  A  low-priority  request  is  u.sed  even  lor  the  write  to  the  replacement  disk.  Mitce  iliis 
di.sk  .services  writes  in  the  user  request  stream  as  well  as  reconstruclion  writes  |  Holland‘>21. 

The  problem  with  this  algorithm  ts  that  it  ts  unable  to  consistently  ulili/e  all  disk  bandwidth 
not  absorbed  by  u.ser  acces.ses.  First,  it  does  not  overlap  reads  ol  surviving  disks  with  writes  to  the 
replacement,  .so  the  surviving  disks  are  idle  with  respect  to  reconstruction  during  the  write  to  the 
replacement,  and  vice  versa.  Second,  the  algorithm  siimilianeously  issues  all  the  leconstruction 
reads  as.socialed  with  a  particular  parity  stripe,  and  then  wtuts  lor  all  to  complete.  Si'nie  ol  these 
read  requests  will  lake  longer  to  complete  than  others,  since  the  depth  oi  the  disk  queues  .md  disk 
head  locations  will  not  he  identical  lor  all  disks.  Theretore.  during  the  read  phase  ol  the  recon¬ 
struction  loop,  each  involved  disk  may  be  idle  Irom  the  time  that  it  completes  its  own  reconstruc¬ 
tion  read  until  the  time  that  the  slowest  read  completes.  Third,  in  the  declustered  plants 
architecture,  not  every  disk  is  involved  in  the  reconstruction  ol  every  parity  stripe,  and  so  some 
disks  remain  idle  during  every  iteration  ol  the  algorithm. 

These  deticiencies  can  be  partially  overcome  by  paralleli/ing  this  algorithm,  thtit  is.  by  simul¬ 
taneously  reconslnicling  a  set  ol  P  parity  stripes  instead  ol  just  one  IHolIandd2j.  but  this  does  not 
guarantee  that  the  reconstruction  process  will  ab.sorb  all  the  available  disk  btindwidth.  Disks  m;i\ 
still  idle  with  respect  to  reconstruction  hecau.se  the  .set  ol  P  parity  stripes  under  reconsirticiion  at 
any  point  in  time  is  not  guaranteed  to  u.se  all  the  disks  in  the  array.  Furthermore,  the  number  ol 


oiiislandiiig  disk  requests  each  independent  reeonslriietion  process  maintains  vanes  as  accesses 
are  issued  and  complete,  and  so  the  number  of  such  processes  must  be  larye  il  the  array  is  to  be 
consistently  ulili/ed.  Finally,  a  large  number  of  reconstruction  processes  require  a  large  amount  ot 
butter  memory  in  the  host  or  controller. 

.A  better  approach  is  to  restructure  the  reconstruction  algorithm  as  a  (lisk-oricnrccl.  instead  ol 
stiipe-oncntvd.  proce.ss  [Merchants)!.  Houd3,  Hollandd.^l.  Instead  ol  creating  one  reeonslriietion 
proce.ss.  the  host  or  array  controller  create.s  C  processes,  each  a.s.socialed  with  one  disk.  Each  ot 
the  (’-1  proces.ses  associated  with  a  surviving  di.sk  e.xecute  the  lollowing  loop: 


repeat 

1 .  Find  the  lowest-numbered  unit  on  this  disk  that  is  needed  lor  reconsiruclion. 

2.  Issue  a  low-priority  request  to  read  the  indicated  unit  into  a  butler. 

Wail  lor  the  read  to  complete. 

4.  Submit  the  unit's  data  to  a  ce/ilraJi/ed  butter  manager  tor  siibsequenl  XOR. 
until  tall  necessary  units  have  been  read) 

The  process  a.ssociated  with  the  replacement  disk  e.xecutes; 

repeat 

1 .  Request  a  butler  ol  lully  reconsinicled  data  Irom  the  buHei  manager,  blocking  il  none. 

2.  Issue  a  low-priority  write  ol  lhe  butler  to  the  replacement  disk. 

Wail  tor  the  wrile  to  complete. 

until  (the  tailed  disk  has  been  leconstrucied) 

In  this  way  the  butler  manager  provides  a  central  repository  tor  data  Irom  parity  stripes  that 
are  currently  "under  reconstruction."  When  a  new  bulTer  arrives  from  a  surviving-disk  process, 
the  buMer  manager  XORs  the  data  into  an  accumulating  "sum"  lor  that  parity  stripe,  and  notes  the 
arrival  ol  a  unit  lor  the  indicated  partly  stripe  from  the  indicated  disk.  When  it  receives  a  request 
Irom  the  replacement-disk  proce.ss  it  searches  its  data  .structures  lor  a  parity  stripe  lor  which  all 
units  have  aiTived.  deletes  the  corresponding  butter  Irom  its  active  list,  and  returns  this  bulter  to 
the  replacement-disk  process.^ 

The  advantage  ol  the  disk-oriented  approach  is  that  it  is  able  to  maintain  one  low-priority 


7  When  a  disk  is  momentarily  idled  due  lo  random  lluctuations  in  the  user  workload,  it  is  possible 
lor  a  reconsiruclion  process  lo  "race  atieail"  ol  the  others  and  consume  a  large  number  ol  hullers. 
This  could  polenlially  leail  lo  increased  huller  stalls  because  other  processes  would  be  unable  lo 
acquire  bullers  when  needed  We  have  not  observed  this  to  he  a  problem  in  our  smuilalions.  hut  ii 
could  he  addressed  by  slowing  or  slopping  any  reconslniclion  process  dial  gels  loo  lar  ahead  o|  ihe 
others. 
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Figure  8:  Comparing  rcconsiruclion  algorithms:  (a)  rcconsiruciion  iimc  and  (b)  average  ii.ser 

respon.se  lime  during  reconstruction. 


request  in  each  disk's  queue  at  all  times,  which  means  that  it  will  absorb  all  o)'  the  array's  band¬ 
width  not  absorbed  by  u.ser  acce.s.ses.  This  is  demon.siraied  in  the  .simulation  results  ol  Figure  S. 
which  plots  the  reconstruction  time  and  average  u.ser  respon.se  lime  versus  the  declusiering  ralio 
(a)  tor  1-way,  8-way,  and  16-way  parallel  stripe-oriented  rcconsiruclion.  and  Tor  disk-orienied 
recon.struction.  in  a  4()-disk  array  u.sing  the  parameters  in  Table  2.  This  ligure  shows  that  the  disk- 
oriented  algorithm  makes  more  eriicieni  use  of  the  .system  resources;  reconstruction  lime  is 
reduced  by  up  to  4{)9i  over  the  16-way  parallel  .stripe-oriented  version,  while  the  average  and  ‘)()ih 
percentile  re.sponse  limes  remain  essentially  the  .same,  independent  ol  the  value  ol  (x.  Low-paral¬ 
lelism  versions  of  the  stripe-oriented  algorithm  yield  .slightly  better  u.ser  respon.se  time  becau.se 
they  cause  disks  to  idle  fairly  rrcquenily.  allowing  u.ser  requests  to  more  olten  arrive  to  lind  an 
empty  disk  queue.  This  does  not  happen  in  the  di.sk-orienied  algorithm  becau.se  reconstruction 
acce.s,ses  arc  always  initiated  as  soon  as  any  disk  becomes  idle. 

A  P-way  parallel  stripe-oriented  algorithm  requires  PG  controller  memory  buller.s.  whde  a 
di.sk-oriented  algorithm  requires  about  2C  or  .^C.  Thus  except  at  very  low  declusiering  ratios,  the 
di.sk-oricntcd  algorithm  u.ses  less  bul'l'er  memory  than  the  stripe-oriented  algorithm  with  signili- 
canl  parallelism,  and  yet  delivers  luster  recon.struction.  In  the  example  40-disk  array  with  (x=0.,6. 
the  disk-oriented  algorithm  requires  about  KK)  buriers.  while  the  8-way  parallel  stripe-oriented 
algorithm  requires  160.  Figure  8  shows  that  the  disk-oriented  algorithm  is  able  to  reconsiruci 


22 


about  twice  as  last  under  these  conditions. 

Furthermore,  becau.se  the  total  buller  reijuiremenis  til  the  disk-onenied  alciMiitini  .nc  icl.i 
tively  .small,  the  required  memory  can  typically  be  borrowed  Irom  the  coiuidller  oi  host  Inillei 
cache.  It  a  reconstruction  butler  is  the  . si/e  ot  one  track  (as  indicated  by  (he  resull.s  ol  die  ne\t  see 
lion)  and  a  di.sk  contains  lO.tKK)  irack.s.  then  the  KM)  hullers  reeiinied  lor  die  e.sample  4()-disk 
array  total  about  19<  of  the  si/.e  ol Dne  di.sk.  It  buller  memory  cost.s  25  limes  .is  nuicli  per  mee.i- 
byte  as  di.sk.  a  buffer  cache  of  K)'/  of  the  si/e  of  one  disk  costs  aboiil  b";  ol  die  dual  disk  cost  in 
the  example  array,  and  .so  is  affordable  in  either  the  host  or  coiiiroller.  The  I'i  needed  lo  elleci 
recon.struclion  rapidly  can  thus  be  borrowed  to  ereaily  speed  recoiistriiciion.  in  most  cases  w  idiout 
dramatically  altering  the  performance  of  the  cache. 

Becau.se  of  its  superior  recon.struclion  lime  characlerislics,  die  disk-orienled  algorillini  is 
u.sed  for  all  the  following  performance  analy.ses. 

6.2.  Unit  of  reconstruction  selection 

In  the  algorithms  pre.sented  .so  far.  the  reconstruction  processes  read  or  write  one  unit  per 
reconstruction  acce.s.s.  Since  the  rate  at  which  a  disk  drive  is  able  to  read  or  write  data  increases 
with  the  .size  of  an  acce.s.s.  it  is  worthwhile  to  investigate  the  beneliis  ol  using  reconsiruciion 
acce.s.se.s  that  arc  different  in  si/e  from  one  data  unit,  that  is.  to  decouple  the  si/e  of  the  reconstruc¬ 
tion  unit  from  that  of  the  data  unit.  The  block-design  ba.sed  layout  described  above  requires  a  sim¬ 
ple  modilicalion  to  support  this  decoupling.  This  .section  describes  this  modilication  and  then 
investigates  the  .sensitivity  of  failure-mode  performance  lo  the  si/e  ol  the  reconsiruciion  unit. 

Referring  back  lo  Figure  4.  a.s.sume  that  the  reconstruction  unit  is  lour  times  as  large  as  tlie 
data  unit,  and  that  di.sk  number  1  has  failed.  If  the  reconstruction  process  at  some  point  reads  lour 
con.seculive  units  .starting  at  offset  zero  on  di.sk  2.  the  data  dial  is  read  coiiiams  data  unit  /)>  /. 
which  is  not  needed  to  recon.siruci  disk  1.  In  general,  since  the  units  necessary  to  recoiisiruci  .i 
particular  drive  are  interspersed  on  the  disks  with  units  that  are  not.  the  reconsiructioii  process 
must  either  wa.sle  lime  and  re.sources  reading  unnece.s.sary  data,  or  it  must  break  up  its  accesses 
into  sizes  smaller  than  one  recon.struclion  unit,  which  results  in  sub.sianlially  less  ellicienl  d...a 
transfer  from  the  disks. 


Ollsci  DISKO  DISKl  D1SK2  DISK3  D1SK4 


Figure  9:  Doubling  the  si/e  ol  the  reconsiruciion  unit. 


This  problem  can  be  eliminated  by  repealing  the  tuple  assignment  pattern  enough  times  to 
pack  multiple  data  stripe  units  into  a  single  reconstruction  unit.  This  modified  layout  is  illustrated 
in  Figure  '■).  where  the  reeonstruelion  unit  size  is  twice  the  data  unit  .size.  While  Figure  4  advances 
to  the  next  tuple  in  the  block  de.sign  atier  each  parity  stripe,  the  modified  layout  advances  alter 
every  n  parity  stripes,  where  n  is  the  reconstruction  unit  size  divided  by  the  data  unit  size. 

Note  that  the  layout  stripes  data  units  acro.ss  reconstruction  units,  instead  of  filling  each 
reconstruction  unit  with  data  units  before  .switching  to  the  next.  In  other  words,  the  first  tuple  is 
u.sed  to  lay  out  subsiripe  0.  the  .second  tuple  for  sub.stripe  1 ,  and  .so  on  up  to  the  fifth  tuple  for  sub- 
stripe  4.  At  this  point,  the  first  tuple  is  u.sed  again  to  lay  out  subsiripe  5.  and  .so  on  up  to  subsiripe 
d.  which  completes  the  block  design  table.  The  proce.ss  repeats  in  the  next  table,  and  the  lull  block 
design  table  is  con.structed  in  the  same  manner  as  in  Figure  5.  Switching  to  the  next  tuple  in  the 
block  design  after  each  substripe  rather  than  after  each  parity  stripe  avoids  excessive  clustering  of 
con.secutive  u.ser  data  units  onto  small  sets  of  di.sks. 


The  above  modification  can  of  course  be  extended  to  pack  an  arbitrary  number  of  data  units 
into  each  reconstruction  unit.  With  this  modified  layout,  each  reconstruction  unit  occupies  a  con¬ 
tiguous  region  on  each  di.sk.  and  so  can  be  read  in  a  single  access  without  transferring  extraneous 
data. 


Using  a  large  reconstruction  unit  speeds  recon.struclion  because  disk  acces.ses  are  more  effi¬ 
cient  for  large  transfers  than  for  small  one.s.  but  it  lengthens  u.ser  response  lime  becau.se  large 
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accesses  monopolize  the  disks  tor  longer  periods  of  lime.  To  quantily  this  trade-olT.  Figure  10 
plots  the  cumulative  response  lime  degradation  during  disk-oriented  reconstruction  versus  the 
declustering  ratio  lor  a  4()-disk  array  driven  to  about  5lWr  laull-lree  utilization  using  the  workload 
described  in  Table  2a.  The  cumulative  degradation  is  the  product  ol'  the  reconstruction  time  and 
the  increase  in  average  u.ser  response  lime  during  reconstruction  over  the  rauli-liee  response  lime. 
By  this  "total  extra  wait  time”  metric,  the  increase  in  el'liciency  obtained  by  increasing  the  size  ol 
the  reconstruction  unit  above  one  track  does  not  compen.sate  I'or  the  elongation  in  response  lime  it 
causes.  Figure  10  establishes  that  the  appropriate  reconstruction  unit  is  approximately  one  track, 
and  .so  all  the  recon.siruction  simulations  in  sub.sequent  .sections  ii.se  this  size. 


Figure  10:  Cumulative  respon.se  time  degradation  during  reconstruction. 
CitmDe'^  ~  *  Rcc oiiTimc 


7.  Performance  evaluation 

This  .section  examines  the  perloimance,  in  terms  ol  throughput  and  respon.se  lime,  ol  ilie 
declustercd  parity  organization  under  three  operating  conditions;  when  the  array  is  laull-lree. 
when  it  is  in  degraded  mode.  i.e.  when  a  di.sk  has  failed  but  no  replacement  is  available,  and  dur¬ 
ing  the  reconstruction  of  a  disk.  Declustering  is  intended  to  improve  degraded-  and  reconsinic- 
lion-mode  performance  without  affecting  laull-lree  perlormance.  This  section  al.so  examines  the 
implications  of  declustering  on  the  reliability  of  the  array.  Declustering  expo.ses  more  disks  to  sec¬ 
ond  failure  during  reconstruction,  but  it  al.so  makes  reconstruction  much  faster. 
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In  this  section  we  will  answer  two  specilic  questions.  First,  how  does  a  parity  declustered 
array  compare  to  an  equivalent-size  non-declu.stered  array  that  u.ses  the  leli-symmetric  RAID 
Level  5  layout  in  multiple  groups  of  di.sks?  In  this  comparison,  the  two  systems  have  the  same 
number  of  di.sks  and  contain  the  .same  amount  of  u.ser  data.  Second,  once  we  understand  when  to 
u.se  declustering  at  all.  what  benelits  can  be  obtained  by  reducing  the  value  of  G  for  a  lixed  num¬ 
ber  of  disks  in  the  array?  Reducing  G  re.sulLs  in  le.ss  available  u.ser  data  space,  but  improves  the 
failure-recovery  performance  substantially.  In  this  latter  exploration  we  include  the  case  where 
C  =  2,  which  corresponds  to  mirrored  disks  with  the  backup  copy  distributed  over  the  array.  For 
completene.ss.  we  also  include  the  ca.se  where  the  mirror  copy  of  each  drive  resides  on  exactly  one 
other  drive  rather  than  being  di.stributed.  All  the  .simulations  that  follow  u.se  the  workload,  array 
configuration,  and  disk  model  de.scribed  in  Table  2. 

The  results  .show  that  parity  declustering  is  a  better  .solution  to  the  failure-recovery  problem 
than  the  traditional  approach  of  breaking  up  an  array  into  mulliplc  independent  groups.  They  also 
.show  that  parity  declu.stering  can  reduce  reconstruction  time  by  up  to  almost  an  order  ol  magni¬ 
tude  over  RAID  Level  5  for  low  values  of  the  declu.stering  ratio,  while  simultaneously  reducing 
u.ser  response  time  by  a  factor  of  about  two. 

7.1.  Comparison  to  RAID  Level  5 

One  way  to  handle  the  problem  of  very  long  u.ser  respon.se  time  during  failure  recovery  in  a 
RAID  Level  5  di.sk  array  is  to  stripe  u.ser  data  acro.ss  multiple  group.s^.  The  overall  average  perfor¬ 
mance  degradation  experienced  when  a  drive  fails  in  a  multi-group  array  is  less  than  that  of  a  sin¬ 
gle  group  array  becau.se  the  load  increa.se.s  trn  only  the  drives  in  the  affected  group.  This  means 
that  on  average  only  one  access  in  N^^oitps  experiences  degraded  performance,  where  is 

the  number  of  groups  in  the  array. 

This  .section  compares  a  multi-group  RAID  Level  .5  organization  to  a  single-group  decliis- 
tered-parity  array.  We  keep  con.stant  the  fraction  of  the  array's  capacity  consumed  by  parity  by 


X.  Following  the  terminology  of  Patterson.  Gib.son.  and  Katz  lPatter.sonXX|.  a  group  in  a  single- 
failure  tolerating  array  is  a  set  of  disks  that  participate  in  a  roilumlaney  enemling  to  tolerate  at  inost 
one  concurrent  failure.  In  this  .sen.se  an  array  with  parity  declustered  over  all  ilisks  is  a  single 
group. 
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Figure  11;  Comparing  RAID  Level  5  to  parity  declustering;  lauli-lree  perlomiaiiee. 

fixing  the  size  of  a  parity  stripe  at  10  units.  This  means  we  et)mpare  a  4-group  O-i- 1  RAID  Level  5 
(a=L())  to  a  C=4(),  G=10  deelustered  array  (a=0.23).  In  Seeiion  9  we  revisit  the  implications  ol 
larger  array  sizes  by  partitioning  very  large  arrays  into  multiple  groups  without  varying  the 
declusiering  ratio. 

7.1.1.  No  effect  on  fault-free  performance 

Figure  1 1  plots  the  average  and  nineiieih-percenlile  u.ser  respon.se  lime  vs.  ihe  achieved  u.ser 
I/O  operations  per  .second  when  the  deelustered  parity  and  RAID  Level  5  aiiays  are  fauli-free. 
This  ligure  shows  that  for  OLTP  workloads,  declustering  parity  causes  no  lault-liee  perlormance 
degradation  with  respect  to  RAID  Level  .‘i. 

7.1.2.  Declustering  greatly  benefits  degraded-mode  performance 

Figure  12  plots  the  respective  disk  arrays'  u.ser  respon.se-iime  againsi  achieved  u.ser  l/Os  per 
.second  when  each  array  contains  one  failed  disk,  but  reconsiruciion  has  not  yet  been  started.  At 
low  workloads  the  two  organizations  perform  identically,  since  the  extra  l/Os  cau.sed  by  accesses 
to  the  failed  disk's  data  can  easily  be  accommodated  when  disk  utilization  is  low.  As  the  workload 
intensity  climbs,  the  failure-recovery  problem  in  RAID  Level  .S  arrays  becomes  evident,  the 
RAID  Level  5  group  containing  the  failure  saturates  at  about  b(M)  user  I/Os  per  second  t  l.s  user 
I/Os  per  .second  per  di.sk).  and  forms  a  .sy.slem-wide  perlormance  bottleneck.  Becau.se  the  declus- 
tered-parity  array  distributes  lailure-indticed  work  acro.ss  all  disks,  it  is  able  to  deliver  about  2.'^''; 


27 


.^20 
2X0 

A. 

=  240 
||  2(H) 

'i  160 
t  120 

XO 

yi 

D 

40 
0 

0  2(H)  4(H)  6(H)  X(H)  l(H)0 

User  l/<  )s  per  Sec()i\iJ 

Figure  12:  Comparing  RAID  Level  5  to  parity  declusteriiig;  degraded-ntode  pei  lormaiiee. 
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Figure  13;  Comparing  RAID  Level  5  to  parity  declustering;  response  time  during  reeonstriieiion. 

more  I/Os  per  second  while  still  delivering  a  OOth  percentile  user  response  time  ol  ahout  1 3'  o\er 
the  tault-lree  case. 

7.1.3.  Declu.stering  benefits  persist  during  recoastruction 

Figure  13  shows  average  and  OOth  percentile  u.ser  response  times  in  nu  (>intnt(  tion  nnu/c:  that 
is.  while  reconstruction  is  ong.  mg.  In  contrast  to  the  degraded-mode  perlormance  shown  in 
Figure  12.  Figure  13  shows  that  at  low  user  workloads,  parity  declusleied  arrays  deliver  a  sligluiy 
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Figure  14:  Comparing  RAID  Level  5  lo  parity  dechisiering;  reconsiriieiion  lime, 

worse  response  lime  in  reconstruction  nn)de.  A  multiple  group  RAID  Level  5  array  suirers  les.s 
penalty  tor  reconstruction  at  low  loads  than  does  a  parity  declustered  array  because  many  disks 
experience  no  load  increase  and  those  that  do  see  an  increase  have  plenty  ol  available  bandwidth. 
But.  because  all  reconstaiction  work  is  performed  by  only  one  group  of  a  RAID  Level  5  multi- 
group  array,  this  group  quickly  becomes  saturated  as  the  on-line  user  load  increases.  Once  a  group 
in  the  RAID  Level  5  array  is  saturated,  its  king  response  limes  dramatically  increa.se  avertige  and 
yoth  percentile  response  times  for  all  u.ser  proces.se.s. 

Turning  to  the  i.ssue  of  lime  until  reconstruction  completes.  Figure  14  illusiraies  the  heart  of 
the  lailure  recovery  problem  in  RAID  Level  5  airays.  Since  the  workload  increases  dramatically 
on  surviving  disks  in  the  group  containing  a  failed  di.sk.  and  since  lhe.se  are  the  only  disks  that  par¬ 
ticipate  in  recovering  the  contents  of  this  I'ailed  disk,  reconstruction  lime  is  veiy  .sensitive  lo  the 
fault-free  u.ser  workload.  The  declustered  parity  organi/aiion  was  designed  lo  overcome  this  prob¬ 
lem  by  both  reducing  the  per-disk  load  increa.se  in  reconstruction  and  uiili/.ing  all  disks  in  the 
array  lo  participate  in  this  reconstruction.  In  other  words,  a  RAID  Level  .S  array  has  reconstruction 
bandwidth  equal  only  to  the  unu.sed  bandwidth  on  the  disks  in  one  group,  but  a  declustered  parity 
array  provides  the  full  unu.sed  bandwidth  of  the  array  lo  effect  reconstruction. 

The  minimum  possible  reconstruction  lime  is  the  lime  required  to  write  the  entire  contents  of 
the  replacement  di.sk  at  the  maximum  bandwidth  of  the  drive.  The  simulated  .^20  megabyte  drives 
support  a  maximum  write  rale  of  approximately  1,6  MB/.scc.  and  .so  the  minimum  piissible  recon¬ 
struction  time  is  approximately  2(H)  second.s.  In  Figure  14.  reconstruction  lime  in  the  declusiered 


parity  organization  at  6(M)  user  I/Os  per  second  (about  5iWi  ol  maximum  utilization)  is  approxi¬ 
mately  260  seconds,  indicating  that  near  optimal  reconstruction  perlormance  is  obtained,  ('onirasl 
this  with  the  RAID  Level  5  organization,  where  reconstruction  time  is  e.s.sentially  unbounded  at 
this  user  access  rale.  To  emphasize.  Figure  13  and  Figure  14  show  respon.se  lime  and  l  econsiruc- 
tion  lime  in  the  .same  on-line  reconstruction  event  -  they  show  that  parity  declusienng  provides 
huge  .savings  in  reconstruction  lime  as  well  as  .savings  in  respon.se  lime  lor  moderalely  and 
heavily  loaded  disk  systems. 


7.1.4.  Declustering  also  benefits  data  reliability 

Our  tinal  ligure  of  merit  is  the  probability  of  losing  data  becau.se  ol  a  disk  lailure  occurring 
while  another  di.sk  is  under  reconstruction.  A.ssuming  that  the  likelihood  ol  a  disk  s  lailure  is  inde¬ 
pendent  of  that  of  each  other  disk;  ihal  i.s.  a.s.suming  that  there  are  no  dependent  disk  lailure  modes 
in  the  .system.  Gib.son  and  Patterson  |Gib.son‘/31  model  the  mean  lime  to  data  loss  as 
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where  is  the  mean  lime  to  failure  for  each  disk.  is  the  number  ol  groups  m  the 

array,  ‘^disksiwr^iroup  *•'’  ^he  number  of  disks  in  one  group  (!^diskspfr>^n>iip  =  RAID  Level 
arrays  and  ^diskspcn’roup-  parily-dedusiered  arrays),  and  is  the  mean  lime  lo 

repair  (reconstruct)  a  failed  di,sk^\  From  this,  the  probability  ol  data  loss  in  a  time  period  T  due  lo 
a  double  disk  failure  condition  can  be  modeled  as 

n  r  I  I  I  -I 
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Figure  15  .shows  the  probability  of  losing  data  within  .5  and  10  years  (optimistic  estimates  ol 
a  disk  array's  u.seful  lifetime)  due  to  a  double-failure  condition  in  each  ol  the  two  organi/aiions. 
using  MTTFjj^i^=  15(),()()()  hours.  The  RAID  Level  5  auay  i.s  more  reliable  at  low  u.ser  access 
rates  becau.se  a  multiple-group  RAID  Level  5  array  can  tolerate  multiple  simultaneous  disk  fail¬ 
ures  without  losing  data  as  long  as  each  failure  occurs  in  a  different  group.  In  eonirasi.  there  are 


9.  Gibson  and  Patterson  treat  dependent  lailure  modes  and  the  ellects  of  on-line  spare  disks  in 
depth.  As  nearly  all  ol  that  work  applies  here  directly,  we  will  only  describe  the  simple  and  illus¬ 
trative  case  of  independent  disk  failures. 
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Figure  15;  Comparing  RAID  Level  5  lo  parity  tleclusiering;  probability  ol  data  loss  wiibin  (al  5 
years  and  (b)  10  years.  Note  that  the  Y-axis  is  log-scaled. 

no  double-l'ailure  conditions  that  do  not  cau.se  data  lo.ss  in  a  declusiered  parity  array.  However,  as 
the  u.ser  acce' s  rate  ri.ses.  the  reconstruction  time,  and  the  resulting  probability  ol  data  loss,  rises 
much  more  rapidly  in  the  RAID  Level  5  array.  For  the  example  arrays  and  workload,  the  decliis- 
tered  parity  array  bocomo.s  more  reliable  at  about  10  u.ser  acce.s.ses  per  .second  per  disk  (a  laiilt- 
t'ree  utilization  of  about  AiWr).  This  is  signilicanily  le.ss  than  the  u.ser  workload  reL)uired  to  saturate 
the  RAID  Level  5  array  during  reconsiniction  (about  14  acces.ses  per  second  per  disk). 

7.1.5.  Summary:  dedustered  parity  allow.s  higher  normal  loads  in  on-line  systems 

In  this  .section  we  have  considered  the  eirecls  ol  replacing  a  multi-group  RAID  Level  5  array 
with  a  dedustered  parity  array  ol  the  same  co.st  and  the  same  u.ser  capacity,  E.s.sentud  lor  its  viabil¬ 
ity,  dedustered  parity  achieves  the  same  t'ault-l'ree  pert'ormance  as  an  equivalent  R.AID  Level  5 
array.  Its  advantage  is  that  it  al.so  supports  higher  u.ser  workloads  with  lower  response  nine  in  boih 
degraded  and  reconstruction  mode,  has  dramatically  shorter  reconstruction  time,  and  at  moderate 
and  high  u.ser  workloads,  has  superior  data  reliability.  This  makes  a  compelling  case  lor  the  use  oi 
parity  dedustering  in  on-line  systems  that  cannot  tolerate  substantial  degradation  during  lailure 
recovery. 


7.2.  Varying  the  declustering  ratio 

In  contrast  to  the  prior  section  which  showed  that  a  single  group  airay  with  a  dcclustenng 
ratio  (a)  between  0.20  and  0.25  has  substantial  advantages  over  a  nuilli-gioup  array  with  a  dcclus- 
tering  ratio  ot  1.0  (RAID  Level  5).  this  .section  examines  the  eliecl  on  lailure  recovery  perlnr- 
mance  of  varying  the  declustering  ratio  (a)  in  a  lixed-si/e  single-group  array.  Becau.se  the  si/e  ol 
the  array,  C.  is  lixed.  varying  the  declustering  ratio  (a  =  (C-1  )/(r-l ))  is  achieved  by  varying  the 
si/e  ol'  each  parity  .stripe,  G.  This  determines  the  parity  overhead,  1/G.  and  correspondingly,  the 
traction  ol' storage  available  to  store  u.ser  data,  (G-  \  )/G.  As  a  is  decrea.scd  from  1.0.  the  user  data 
capacity  ot  the  array  decreases  but  the  I'ailure-recovery  perlormance  improves  since  the  total  lail- 
ure-induced  workload  decrea.se.s.  We  shall  show  that  declustering  ratios  larger  than  0.25.  which 
provide  low  parity  overhead,  yield  much  of  the  perlormance  benelits  ol  the  example  in  (he  last 
section.  We  shall  also  show  that  in  .systems  very  .sensitive  to  perlormance  during  lailure  recovery, 
declustered  mirroring  (G  =  2)  is  a  special  ca.se  with  minimal  declustering  ratio,  high  parity  over¬ 
head.  and  I'ailure-recovery  perlormance  advantages  unavailable  in  most  other  declustered  organi¬ 
zations. 

We  consider  the  same  array  size  (40  disks),  and  report  the  perlormance  ol  the  ana_\  s  on  ihe 
workload  de.scribed  in  Table  2a.  using  a  lixed  u.ser  access  rate  ol  14  u.ser  I/Os  per  second  per  disk. 
This  rate  was  .selected  becau.se  it  is  approximately  the  maximum  rate  lor  this  workload  that  the 
arrays  can  support  using  a  RAID  Level  5  layout  (a=  1.0).  It  causes  the  disks  to  he  utilized  at 
slightly  le.ss  than  .50';^  in  the  lault-lree  ca.se. 

The  arrays  are  evaluated  at  a  =  1 .0,  a  =  0.75,  a  =  0.5.  a  =  0.25.  and  two  special  cases  G  =  ^ 
and  G  -2.  The  ca.se  G  =  is  signilicani  becau.se  when  a  parity  stripe  contains  only  two  data  units 
and  one  parity  unit,  it  is  possible  to  improve  small-write  perlormance  by  replacing  the  normal 
lour-acce.ss  update  (data  read-modily-write  lollowed  by  parity  read-modiry-wrile)  by  a  three- 
acce.ss  update.  In  this  ca.se.  the  controller  reads  the  data  unit  that  is  not  being  updated,  computes 
the  new  parity  Irom  this  unit  and  the  unit  to  be  written,  and  then  writes  the  new  data  and  new  par- 
ity. 

The  ca.se  G  =  2  is  important  becau.se  it  is  equivalent  to  disk  miiToring.  except  that  the  backup 
copy  ol  each  disk  is  distributed  acro.ss  the  other  di.sks  in  the  airay  instead  ol  being  located  on  a 
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Figure  16;  Varying  the  dccki.stering  ratio;  user  response  lime  in  laiill-lree  mode. 


single  drive.  For  eompari.son.  the  graph;)  also  include  the  ca.se  where  the  backup  copy  is  located 
on  a  single  drive.  To  distinguish  between  the.se  two,  we  rel'er  to  the  ease  where  the  backup  copy  is 
on  a  single  disk  as  ■'mirroring”,  and  the  ca.se  where  it  is  declustered  as  the  ca.se. 

In  both  mirroring  and  parity  declustering  with  G  =  2.  the  lour  acces.ses  associiiied  with  a 
small-write  operation  are  replaced  by  two;  one  write  to  each  copy  ol  the  data.  Another  optimiza¬ 
tion  also  applies;  since  there  are  two  copies  of  every  data  unit,  it  is  possible  to  improve  the  perlor- 
mance  ol'  the  array  on  read  acces.ses  by  selecting  the  "clo.ser”  of  the  two  copies  at  ihe  time  the 
acce.ss  is  initiated  |BittonXX|.  The  raidSim  simulator  contains  an  accurate  disk  model,  and  so  we 
implement  this  as  lollows;  when  a  read  acce.ss  is  initiated,  the  simulator  locates  the  two  copies 
that  can  be  read  and  then  computes  the  completion  time  ol  the  request  lor  each  ol  the  two  possible 
acce.s.se.s.  This  computation  takes  into  account  all  components  ol  the  access  time  (queueing,  seek¬ 
ing.  rotational  latency,  and  data  transler).  The  simulator  selects  and  issues  the  acce.ss  that  will 
complete  sooner.  We  refer  to  this  as  the  shorti'.st  access  optimization.  We  will  see  that  these  opti¬ 
mizations  can  be  signilicanl  for  performance,  but  they  only  apply  in  the  Ci=2  and  ('>=}  ca.ses. 
which  are  expensive  in  terms  of  capacity  overhead. 

7.2.1.  Fault-free  performance:  benefits  of  high  overhead  optimizations 

Figure  16  shows  that  the  respon.se  time  performance  of  a  fault-free  array  is  indepcndeni  ol  a 
in  all  ca.se,s  except  C=2  and  G=.T  where  the  above-de.scribed  optimizations  can  be  applied.  This 
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Figure  17:  Varying  the  dcclusiering  ratio:  user  response  lime  in  degraded  mode. 

ligure  conlirms  the  result  ol  Seeiion  7.1.1  that  decluslering  parity  does  not  negatively  allect  latilt- 
Iree  pertormance.  Similarly,  deelusiered  parity  with  G=2  perlorms  essentially  identically  to  mir¬ 
roring.  Figure  16  does  not  show  a  large  benelil  Tor  the  ihree-aece.ss  update  when  G=.^  hecau.se  the 
OLTP  workload  u.sed  is  dominated  by  read  rather  than  write  operations.  However,  lor  (;’=2.  the 
combined  rcsponsc-limc  benelil  ol' a  iw<>-acee.s,s  updaie  and  the  shortest  access  opnmi/aiioii  is 
do, so  to  a  savings  ol'  4()Vr  lor  average  respon.se  time,  and  a  savings  ol  2()'f  lor  OOih-perceniile 
respon.se  lime.  Thus  tor  workloads  such  as  OLTP  that  are  dominated  by  small  accesses,  the  mam 
consideration  tor  tauli-tree  pertormance  is  whether  or  not  the  value  ol  the  optimi/ations  availalile 
in  the  G=2  ca.se  warrants  the  large  capacity  overhead  it  incurs. 

7.2.2.  Degraded-mode  and  reconstruction-mode  performance:  declustering  at  its  best 

Figure  17  demonstrates  the  decluslering  ratio's  direct  elleci  on  degraded-mode  pertormance 
ol  an  array.  As  the  decluslering  ralio.  a.  ranges  down  Irom  1 .0  the  array  s  respon.se  time  decrea.ses 
almo.st  linearly  to  a  minimum  that  is  about  hall  Ol  its  maximum  (at  a=l.()l.  Comparing  Figure  17 
to  Figure  16.  the  minimum  degraded-mode  re.spon.se  limes  that  occur  with  small  decluslering 
ratios  are  little  degraded  Irom  their  laull-lree  counterparts.  This  lack  ol  degradation  at  low  a 
occurs  becau.se  reconstructing  data  on-lhe-lly  is  adding  very  little  to  each  surviving  disk's  uiili/a- 
lion.  How’ver.  when  a=l.()  the  degrat.ed-mode  ulili/.alion  is  clo.se  to  100'^  because  this  read- 
intensive  u.ser  workload  induces  a  laull-lree  ulili/alion  ol  slighlly  less  than  .sOC  .  Hence,  response 
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Figure  18:  Varying  ihe  doclusloring  ralio:  rccon.siruclion  lime. 

lime  is  dramatically  longer  when  degraded  than  when  raull-liee. 

U.ser  response  time  during  reconstruction  shows  e.s.seniially  the  same  characlerislics  as  u.ser 
response  lime  in  degraded  mode  because  u.ser  acce.s.ses  are  given  sinci  priority  over  reconsiiuc- 
lion  accesses,  and  .so  reconstructi('n  is  just  a  little  more  load  on  each  surviving  disk.  Ho\ve\'er. 
Figure  I X  shows  that  reconstruction  time  decrea.ses  by  an  order  ol'  magnitude  as  a  drops  from  !  .0 
to  0.2.  The  .shape  ot  this  curve  is  determined  by  the  inieraciion  ol  (wo  separate  boiileiK'cks:  at 
high  a  the  rate  at  which  data  can  be  read  trom  surviving  disks  limits  reconsiruciion  rate.  but.  at 
low  a  the  replacement  disk  is  the  boiileneck'**.  Since  a  high  declusiermg  ratio  causes  survi\ing 
disks  to  be  .saturated  with  work,  reconstruction  lime  Tails  oil  steeply  with  decreasing  a.  Ilallemng 
out  at  the  point  where  the  replacement  disk  becomes  .saturated  with  reconstruction  writes. 

Finally,  reconstruction  lime  is  much  longer  Tor  mirroring  than  Tor  declusiered  parity  with 
G=2  becau.se  a  declusiered  array  has  the  aggregate  uini.sed  bandwidlh  oT  the  entire  array  available 
to  read  blocks  oT  the  backup  copy,  while  a  mirrored  anay  has  only  the  bandwidlh  oT  a  single  disk. 
The  reconstruction  dme  is  not  as  long  as  m  the  ca.se  oTa=l  .()  (RAID  Level  .s)  because  mirroring 
handles  u.ser  acce.s.ses  more  eliiciently. 

It).  IT  the  array  has  on-line  spare  disks,  llns  bottleneck  may  be  eliminated,  allowing  leeonsliuction 
lime  lo  be  Turlher  reduced,  by  di.stributing  the  capacity  o|  spare  disks  throughout  the  array 
lMenoig)2b.  Ng92|. 
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Figure  19:  Varying  the  declusiering  ratio;  probability  ol  data  loss  within  (a)  5  years  and  (bi  10 

years.  Note  that  the  Y-axis  is  log-.scaled. 

7.2.3.  High  data  reliability:  another  advantage  for  deciustered  parity 

Figure  19  .shows  the  probability  ol'  lo.sing  data  within  5  and  10  years  due  to  a  disk  lailure 
occurring  while  the  reconstruction  of  another  disk  is  ongoing  (refer  to  Section  7, 1 .4).  Decreasoig 
reconstruction  time  by  decreasing  the  declustering  ratio  in  an  array  directly  decrea.ses  the  proba¬ 
bility  of  data  lo.ss  in  any  time  period.  This  figure,  then,  is  largely  determined  by  the  data  in 
Figure  IX.  except  that  the  mirroring  ca.se  has  sub.sianiially  lower  probability  of  data  loss  over  the 
given  time  periods.  This  is  becau.se  the  mirrored  conliguration  can  tolerate  many  simuliancoii.s 
di.sk  failures,  .so  long  as  each  failure  occurs  in  a  di.stinct  mirror  pair.  In  the  declustering  ca.ses. 
including  G=2.  the  simultaneous  failure  of  any  two  di.sks  in  the  array  results  in  data  lo.ss. 

7.2.4.  Summary 

In  contrast  to  parity  deciustered  arrays  with  lixed  declustering  ratios  determined  by  a  faircosl 
compari.son  to  multi-group  RAID  Level  5  arrays  in  Section  7. 1 .  this  .section  examined  die  choices 
available  if  an  array's  declustering  ratio  is  varied.  We  found  that  deciustered  mirroring  (the  "Ct=2" 
ca.se),  although  expensive  in  terms  ol' capacity  overhead,  offers  special  benclits  over  deciustered 
parity  layouts  with  slightly  higher  declustering  ratios.  Alternatively,  if  lowering  cost  or  overhead 
is  of  prime  concern,  then  a  declustering  ratio  of  ()..‘S  is  of  particular  interest.  It  provides  lialf  the 
benelil  for  improving  degraded-  and  reconstruction-mode  performance  and  nearly  all  the  benelit 
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ior  reducing  rcconslruction  time  and  data  reliability  while  casting  (inly  tw  ice  the  parii\  meihead 
of  a  single  group  RAID  Level  5  array. 

8.  Work  reducing  variations  to  reconstruction  algorithms 

Munt/.  and  Lui  IMunt/.yoi  identified  two  simple  modifications  to  a  lecdiistnictKin  algorithm, 
each  intended  to  improve  reconstruction-mode  performance  or  reduce  reconstruction  time  by 
reducing  the  total  work  required  of  surviving  di.sks.  In  the  lirst.  called  irdirci  ttou  nt  ikuIs.  user 
read  requests  for  data  units  residing  on  a  failed  di.sk  that  have  already  been  reconstructed  are  ser¬ 
viced  by  the  replacement  disk  instead  of  invoking  on-lhe-lly  reconstruction  as  is  di'iie  in  degraded 
mtide.  This  reduces  the  number  of  di.sk  acces.ses  needed  to  service  the  read  Irom  CM  to  I 
Although  this  .seems  to  be  an  obvious  thing  to  do.  we  shall  see  that  it  can  lengthen  recoiistruciion 
time.  In  the  .second  modilication,  when  a  user  read  rei.|uesi  causes  a  data 

unit  to  be  reconstructed  on-the-lly.  that  data  unit  is  written  to  the  replacement  drive  as  well  as 
being  delivered  to  the  requesting  proce.ss.  This  is  intended  to  speed  reconstruction  by  reducing  the 
total  number  of  data  units  that  need  to  be  recovered,  but  in  the  following  evaluation  it  will  turn  out 
to  have  little  effect. 

Additionally,  there  are  two  ways  to  .service  a  u.ser  write  to  a  data  unit  whose  contents  have  not 
yet  been  reconstructed.  In  the  lirst.  the  new  data  is  written  directly  to  the  replacement  drive,  and 
the  parity  updated  to  rellect  this  change.  In  the  second,  only  the  parity  is  upd.iied;  the  dai.i  is  nm 
written  to  disk  at  all.  Figure  20  illustrates  the  two  approaches:  in  the  lirst  method  the  new  d.ita  is 
written  to  the  replacement  disk,  and  the  parity  is  updated  by  reading  all  the  other  units  in  the  par¬ 
ity  .stripe.  XORing  them  together  with  the  new  data,  and  writing  the  result  to  the  paritv  unit.  In  the 
.second  method,  the  parity  is  updated  in  the  .same  manner  as  the  lirst  option,  but  the  new  data  is  not 
written  to  the  replacement  drive.  In  the  latter  ca.se.  the  data  unit  being  updated  remains  mv.did 
until  recovered  by  the  background  reconstruction  process.  The  difference  between  these  iwk 
approaches  is  that  the  former  writes  the  replacement  disk  while  the  latter  does  not.  V\'e  v  lew  send¬ 
ing  u.ser  writes  to  the  replacement  disk  (the  former  approach)  as  a  third  modilication  that  c.m  be 
applied,  and  refer  to  it  as  the  user  writes  option. 

The.se  three  options  affect  the  distribution  of  work  between  surviving  disks  and  the  repl.ice- 
ment  disk.  When  all  three  options  are  off.  the  replacement  disk  sees  only  reconstruction  w  ines 
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Figure  2(1;  Twd  mclhiids  lor  servicing  a  user  write  to  as-yet  unreeonsiriieied  data. 


Method  /  writes  the  new  data  to  the  replacement  and  updates  the  parity.  Method  2  updates 
only  the  parin',  and  allows  the  hackitround  reconstruction  process  to  later  install  the  new  data 
on  the  replacement  drive. 


and  user  writes  to  data  that  has  been  previously  reconstructed,  while  the  remainder  ol  the  work¬ 
load  is  serviced  hy  surviving  drives.  Enabling  an  option  shills  workload  Irom  the  surviving  disks 
to  the  replacement  di.sk:  redirecting  reads  shills  u.ser-reud  workload,  piggybacking  writes  sliilis 
reconstruction  workload,  and  enabling  u.ser  writes  to  the  replacement  shills  u.ser-wriic  workload. 

In  a  previous  paper  (Hollandy21  we  analyzed  the  perlorniance  ol  these  options  using  the 
stripe-oriented  reconstruction  algorithm,  a  5iy.i  write  workload,  and  small  sniping  units  (4  KB). 
This  section  revi.ses  this  analysis  using  the  disk-oriented  reconstruction  algorithm,  the  more  real¬ 
istic  and  le.ss  wriie-intensive  workload  de.scrihed  in  Table  2a.  and  track-sized  sinpe  unit.s.  Larger 
stripe  units  have  been  recommended  lor  varied  workloads  becau.se  they  reduce  the  probability 
that  small  requests  require  .service  Irom  multiple  disks  arms  while  still  allowing  parallel  iran.sJer 
tor  requests  large  enough  to  benelit  substantially  iChenyobl.  The  prior  study  showed  that  the  pig¬ 
gybacking  and  u.ser-wriles  options  had  a  measurable  hut  not  very  signilicani  ellect  on  reconstruc¬ 
tion  lime.  Becau.se  ol  the  lower  write  traction  and  the  larger  reconstruction  unit  in  the  new  study, 
these  ellecls  have  e.s.seniially  disappeared,  and  .so  we  lind  that  redirection  ol  reads  is  (he  onlv 
option  that  signilicanily  inlluences  lailure-mode  perlormance.  As  expected,  the  ellecls  ol  redirec¬ 
tion  are  more  pronounced  in  the  new  study  becau.se  ol  the  read-dominated  workload. 


In  the  lollowing  we  show  at  most  live  of  the  po.ssible  eight  combinations  ol  the.se  three  recon- 
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Figure  21;  User  response  lime  for  live  eombinalions  ol  reeonslriielion  options. 


In  the  legend,  R  indicates  redirection  of  reads.  P  indicates  pi t;;.^\  hackini;  of  writes.  W  indicates 
user-writes  to  the  replacement  drive.  0  indicates  that  an  option  is  off.  and  I  indi<  ate s  that  an 
option  is  on.  The  fit^  it  re  is  difficult  to  read  because  of  the  overlapping  H'tt's:  in  all  plots,  the  OOO, 
010,  andlH)\  curves  are  essential Iv  coincident,  as  are  the  100  and  \ )  1  t  urves. 


struction  algorithm  options;  all  options  oil',  each  option  on  with  the  other  two  olT.  and  all  options 
on.  A.S  we  .shall  .see.  only  one  option,  the  redirection  ol  reads  option,  is  elleciive  lor  the  workload 
ol  Table  2. 

8.1.  The  effects  of  the  reconstruction  options 

Figure  21  shows  the  average  and  9()ih  percentile  u.ser  respon.se  lime  during  reconstruction  lor 
live  combinations  of  the  reconstruction  options.  This  ligure  shows  that  the  piggybacking  of  writes 
and  u.scr-wrilcs  options  have  little  effect  on  u.ser  re.spon.se  lime.  To  understand  this,  fust  note  that 
updating  a  particular  unit  on  the  replacement  drive  can  improve  response  lime  only  if  that  unit  is 
re-acce.s.sed  prior  to  the  completion  of  reconstruction.  However,  for  a  random  workload,  the  prob¬ 
ability  of  rc-acces.sing  the  .same  data  unit  before  reconstruction  completes  is  fairly  small,  and  so 
the.se  two  reconstruction  options  have  little  effect. 


Redirection  of  reads,  in  conira.si  to  the  other  options,  can  be  effective  for  the  OLTP  workload. 
It  improves  u.ser  respon.se  lime  by  when  the  declusiering  ratio  is  near  1.0.  with  its  benelii 

diminishing  to  zero  as  this  ratio  decrea.se.s.  It  is  mo.si  effective  when  this  ratio  is  large  becau.se  the 


Figure  22:  Reconstruction  time  Cor  live  combinations  of  reconstruction  options. 

Refer  to  Fii^ure  21  for  a  description  of  the  legend. 

surviving  disks  are  heavily  loaded  by  reconstruction.  OH'-loading  work  from  the.se  drives  by  redi¬ 
recting  reads  to  the  underutilized  replacement  disk  improves  respon.se  time  by  both  reducing  the 
number  t)t'  I/Os  nece.ssary  to  .service  a  u.ser  read  and  by  servicing  such  a  read  on  a  lightly-utili/ed 
drive.  As  a  is  reduced,  however,  both  the.se  elTects  diminish;  it  takes  fewer  disk  reads  to  service  a 
u.ser  read  to  the  failed  drive  and  the  replacement  di.sk  utilization  increa,ses  hecau.se  these  more 
lightly  loaded  .surviving  disks  reconstruct  units  more  quickly. 

Figure  22  shows  the  recon.struction  time  for  live  combinations  of  options.  The  piggybacking 
of  writes  and  u.ser-writes  options  again  make  little  difference.  In  this  ca.se.  it  is  becairse  the  work¬ 
load  is  dominated  by  acce.s.se.s  that  are  smaller  than  one  reconstruction  unit.  When  a  u.ser-  or  pig- 
gybacked-write  operation  occurs  on  the  replacement  disk,  only  a  fraction  of  a  reconstruction  unit 
is  updated  and  marked  as  reconstructed.  When  a  reconstruction  proce.ss  e.xamines  this  unit  to 
decide  if  it  needs  to  be  reconstnicted.  it  will  (ind  that  some  portion  of  the  unit  is  still  unrecowred. 
The  reconstruction  process  then  has  the  option  of  reconstructing  only  the  unrecovered  portion  of 
the  unit,  or  of  reconstructing  the  entire  unit.  Becau.so  there  is  little  difference  between  the  time 
taken  to  read  an  entire  track  and  the  time  taken  to  read  a  track  le.ss  one  unit,  and  becau.se  many 
di.sks  cannot  read  two  blocks  on  one  track  as  quickly  as  they  read  the  whole  track,  our  implemen¬ 
tation  always  choo.ses  the  latter  option.  Hence,  mo.si  of  the  potential  benelits  to  reconstruction 
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lime  from  user-  and  piggybacked-wrile  options  are  lost,  since  these  writes  do  not  update  entire 
reconstruclii)n  units.  Moreover,  at  low  a.  these  two  options  actually  have  a  negative  ellect  on 
recon.siruciion  lime  .since  they  cau.se  more  work  to  be  .sent  to  the  over-uiili/ed  replacement  disk. 

While  redirection  of  reads  reduces  user  re.spon.se  lime  during  recovery  at  all  values  ol  (/.  it 
does  not  have  the  .same  effect  on  recon.siruciion  time.  Figure  22  .shows  that  enabling  this  option 
halves  reconstruction  lime  at  a=l.().  hut  doubles  it  at  a=().  1.  This  is  partly  becau.se  the  replace¬ 
ment  di.sk  is  over-ulili/.ed  at  low  a.  but  there  is  also  another  reason.  In  the  absence  of  u.ser  work¬ 
load.  the  replacement  disk  services  only  writes  irom  the  reconstruction  process  and  writes  to 
previously-reconstructed  data.  Becau.se  the  recon.siruction  writes  are  purely  sec|uentiaj.  the 
replacement  drive  experiences  a  very  low  average  positioning  overhead,  and  operates  at  high  efli- 
ciency.  When  any  of  the  reconstruction  options  are  enabled,  the  replacement  disk  incurs  a  sigmli- 
cant  reduction  in  its  efiiciency  becau.se  it  mii.st  .service  far  more  randomly  located  acces.ses.  This 
accounls  for  the  signilicani  increa.se  in  reconstruction  time  at  low  a  when  the  reconstruction 
options  are  enabled. 

8.2.  Dynamic  use  of  reconstruction  options 

As  Figure  22  shows,  the  value  of  a  reconstruction  algorithm  option  depends  on  which  part  ol 
the  array,  replacement  or  surviving  disk,  is  limiting  the  rale  of  reconstruction.  In  addition  to  being 
dependent  on  an  array  s  declusiering  ratio,  this  elfecl  is  dependent  on  the  amount  ol  the  laileil 
di.sk's  data  .so  far  reconstructed.  Recognizing  this  dependence.  Muntz  and  Liii  suggested  that  tlie 
reconstruction  algorithm  should  monitor  disk  utilizations  and  enable  or  di.sable  each  option 
dynamically,  depending  on  whether  surviving  disks  or  the  replacement  disk  constiiutes  a  bottle¬ 
neck. 

Figure  23  and  Figure  24  show,  respectively,  u.ser  respon.se  lime  during  reconstruction  and 
reconstruction  lime  using  a  monitored  application  ol  redirection  of  reads  instead  of  a  constant 
(always  enabled)  application  or  no  (always  di.sabled)  application.  We  have  chosen  to  dynamically 
apply  only  the  redirection  of  reads  option  becau.se  it  is  the  only  option  that  signilicaiilly  affects 
recovery  mode  performance  for  the  OLTP  workload.  We  refer  to  this  dynamic  reconstruction 
algorithm  as  the  monitored  redirection  option.  We  employ  a  simple  monitoring  scheme;  the  diirti- 
tion  of  di.sk  bu.sy  and  idle  periods  is  recorded,  and  every  -3(M)  acces.ses  a  new  estimate  for  the  utili- 
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Dei.lu.stcrini;  RalU)  (a) 

Figure  23;  Evaluating  monitored  redirection  ol  reads:  response  lime. 


Dctlustcrui"  Raljo  (a) 

Figure  24;  Evaluating  monitored  redirection  ol  reads:  reconsiruclion  lime. 

/ation  ol  each  disk  i.s  generated.  11  the  replacemeni  disk  uiili/.aiion  is  higher  than  ihe  average 
surviving  di.sk  utilization,  the  replacement  is  declared  the  boitleneck.  and  redirection  ol  reads  rs 
di.sabled  until  the  next  lime  the  estimates  are  updated.  11  ihe  opposite  i.s  line,  ihe  surviving  disks 
are  declared  the  bottleneck,  and  redirection  ok  read.s  i.s  enabled  iiniil  ihe  nexi  uiilizaiion  esiimaie 
update. 

As  Figure  23  shows,  the  respon.se-time  perlormance  of  monitored  redirection  is  aclually 
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worse  at  moderate  and  low  deeliistennj:  ratios  than  the  eonstant-redirection  case  hecaiise  redirec¬ 
tion  ol  reads,  unitormly  benelieial  to  response  time  when  enabled,  is  largely  disabled.  F-igurc  24. 
however,  shows  that  reconstruction  time  is  minimi/ed  becaii.se  the  reconstruction  rale  is  .il  all 
times  limited  by  whichever  disks  are  the  recon.struction  bottleneck. 

To  summarize,  tor  the  OLTP  workload,  the  only  ellective  work-reducing  variation  to  the 
disk-oriented  reconstruction  algorithm  is  the  redirection  ol  reads.  This  option  improves  user 
respon.se  time  by  as  much  as  lOVr  -  2(}7i  when  the  declustering  ratio  is  large  while  reducing  recon¬ 
.struction  lime  by  as  much  as  .  However  at  a  low  declustering  ratio,  redirection  ol  reads  bene- 
tils  respon.se  lime  by  only  a  very  small  amount,  and  lengthens  reconstruction  time  by  over- 
ulili/.ing  the  replacement  disk.  A  dynamic  application  olThis  option  ba.sed  on  monitoring  disk  uii- 
li/alions  achieves  much  of  its  benefits  without  its  costs  independent  ol  the  declustering  ratio. 

9.  Array  configuration:  single  versus  multiple  groups  revisited 

Section  7. 1  shows  that  lor  arrays  ol  up  to  about  40  disk.s.  a  single  decliislered  group  organi/a- 
lion  yields  better  failure-mode  performance  than  an  organization  that  separates  disks  into  a  set  of 
independent  RAID  Level  .5  groups.  In  this  .section  we  revisit  the  question  of  when  to  conligure  a 
set  of  disks  as  a  single  group  or  multiple  groups,  where  the  data  reliability  ol  each  group  is  inde¬ 
pendent  of  failures  in  other  groups.  In  particular,  we  are  interested  in  how  to  conligure  airays  that 
have  more  than  40  disk.s.  In  this  context  an  array  conliguraiion  is  a  set  of  values  for  the  number  of 
di.sks  in  a  group.  C.  the  number  of  units  in  one  parity  stripe.  O',  and  the  number  of  groups,  denoted 
•''hiill  •'soe  that  it  is  not  always  de.sir e.ble.  and  sometimes  not  viable,  to  siruclure  ti  large 
array  as  a  single  decliislered  group. 

A  primary  consideration  in  the  construction  of  large  single-group  arrays  is  their  susceptibility 
to  data  loss  arising  from  failures  in  equipment  other  than  the  disk.s  |Gibson9.^|.  For  example,  if  the 
bu.s-connected  disk  array  architecture  shown  in  Figure  la  provides  only  one  path  to  each  disk  but 
shares  this  path  over  multiple  disks,  the  failure  of  a  path  renders  multiple  di.sks  unavailable, 
although  not  damaged,  for  long  periods  ol  lime.  We  .say  that  such  a  path  liiilurc  con.sliltile.s  a 
dependent  failure  mode  for  the  .set  of  di.sks  on  that  path.  To  make  such  an  array  lolertini  of  till  sin¬ 
gle  failures  according  to  criteria  one  in  Section  4. 1.  these  disks  may  not  reside  in  the  same  redun¬ 
dancy  group.  A  cost  effective  way  to  do  this  is  to  organize  each  rank  of  drives  as  an  independent 
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parity  group.  It  lollows  then  that  the  .si/e  ol'eaeh  deelustered  group  ((’l  can  be  no  larger  than  the 
number  cable  paths  in  the  airay.  With  today's  technology,  board  area  and  cable  connector  si/e 
limit  the  number  of  paths  operating  in  a  single  array  to  a  relatively  small  number,  usually  much 
le.ss  than  40.  In  this  ca.se.  laytiuts  ba.sed  on  block  designs  and  the  results  ol  .Section  7. 1  are  directly 
applicable. 

In  disk  arrays  with  sul'ticient  redundancy  in  non-disk  components,  such  as  the  liilly  dupli¬ 
cated  versions  of  Figure  1.  the  number  of  di.sks  managed  as  a  single  parity  group  could  he  much 
larger  than  40.  In  the  proce.ss  of  conliguring  such  large  anays.  the  fundamental  trade-olf  is 
between  cost,  data  reliability,  parity  overhead,  lault-free  performance,  and  on-line  failure  recov¬ 
ery  performance.  Remaining  with  the  OLTP-like  model  of  such  an  airay  s  workload,  we  assume 
that  the  goal  of  a  conliguralion  is  to  achieve  the  lowest  cost  array  which  meets  specilic  I/O 
throughput  and  respon.se  time  requirements  and  that  component  disk  capacity  can  he  manipulated 
to  meet  data  capacity  targets.  In  particular,  to  maximi/,e  throughput  for  a  target  number  of  disks, 
we  seek  fault-free  disk  utilizations  as  high  as  po.ssible  while  insuring  that  response  lime  require¬ 
ments  are  met  during  on-line  reconstruction.  The  most  effective  method  ol  doing  this  is  to  mini¬ 
mize  the  increa.se  in  disk  utilization  during  on-line  reconstruction,  which  can  be  scaled  by  the 
declustenng  ratio,  a  =  (G-1  i/lC-l ),  hecau.se  this  directly  inlluences  the  increase  in  load  on  surviv¬ 
ing  disks  during  on-ihe-lly  reconstruction  in  degraded-mode.  Left  to  be  determined  are  the  si/e  ol 
each  group  in  the  array.  C,  and  the  number  of  the.se  groups.  'iuP'ici  ol  these  two 

parameters  on  data  reliability  and  parity  overhead. 

The  data  reliability  equations  in  Section  7.1.4  show  that  mean  time  until  data  loss  is  inversely 
proportional  to  group  size  (D.  and  failure  recovery  lime  (MTTR).  lor  a  li.xed  array  size.  But  given 
a  fault-free  u.ser  workload  and  a  declustering  ratio,  failure  recovery  lime  is  a  largely  a  function  ol 
a  single  disk's  capacity  and  performance  as  shown  in  Figure  14  and  Figure  LS.  This  implies  that 
data  reliability  increa.ses  with  decreasing  group  size  (which  means  increasing  the  number  of 
groups).  However,  with  a  fixed  dccluslering  ratio,  decreasing  the  group  size  reduces  the  parity 
.stripe  size.  G,  which  increa.ses  the  parity  overhead  of  the  array.  l/G.  Increasing  parity  overhead,  in 
turn,  increa.ses  the  amount  of  .storage  space  each  di.sk  mu.si  provide,  increasing  overall  array  cost. 
This  is  the  final  trade-off:  data  reliability  against  eost. 


Figure  25  quantilios  ihis  reliability  versus  overhead  irade-oH  lor  various  array  si/es.  using 
a  =  0.25  and  a  =  0.5.  the  IBM  Lightning  drives  described  in  Table  2.  the  reliabilils  model  m  .Sec¬ 
tion  7.1.4,  and  reconstruction  times  given  in  Figure  IS.  In  general,  reconstruciioii  time  ma>  be 
estimated  by  simulation,  as  in  this  paper,  or  by  using  an  analytical  model  such  as  that  <>1  .Vlerchani 
and  Yu  lMerchant021  or  Munt/.  and  Lui  |Munt/.001. 

Figure  25  shows  that  the  large  arrays  considered  (400  and  SOO  disks)  will  have  a  to  ,^0'/<' 
chance  ol  losing  data  within  10  years  when  conligured  as  a  single  group.  Where  this  is  loo  large  a 
ri.sk.  the  array  must  be  partitioned  into  multiple  independent  groups.  When  this  is  done,  data  reli¬ 
ability  can  be  increa.sed  by  an  order  ot  magnitude  while  parity  overhead  remains  beneath  20'!. 
when  a=0.25.  and  beneath  10';^,  when  a=0.5. 

This  tigure  also  allows  us  to  revisit  the  question  presented  in  .Section  4.4.  In  this  section  we 
di.scu.s,sed  .selecting  between  a  declustered  parity  layout  ba.sed  on  balanced  incomplete  block 
designs  or  ba.sed  on  random  permutations.  Pessimistically,  il  a  declustered  parity  group  si/e 
exceeds  40  wo  cannot  guarantee  a  small  block  de.sign  lor  arbitrary'  decliistering  ratio;  lor  such  a 
guarantee.  Merchant  and  Yu's  random  permutations  layout  can  be  irsed.  In  terms  ol  Figure  25. 
points  in  the  lower  right  ol'thc  data  lo.ss  probability  charts  correspond  to  multiple  group  conligii- 
rations  where  individual  groups  are  not  larger  than  40  disks.  II  block  designs  are  u.sed.  this  tigure 
also  shows  that  the  parity  overhead  can  be  as  low  as  10*7  when  a  =  0.25.  or  5'^  when  (/.  =  0.5 

10.  Conclusions 

Redundant  disk  arrays,  developed  to  in.sure  that  lost  data  can  be  recovered  quickly,  have  the 
ability  to  provide  on-line  service  during  lailure  recovery,  but  olten  with  dismal  perlormance.  For 
example,  the  80‘!(  read  workload  characteristics  ol  OLTP,  serviced  by  a  40-di,sk  RAID  Level  5 
array  increa.se.s  in  inten.sity  by  about  (tOVr  during  on-line  lailure  recovery,  so  lault-lree  ulili/alion 
must  be  less  than  about  6(y/r  il  respon.se  time  during  recoveiw  is  to  meet  any  realistic  target.  In  tins 
paper  we  evaluated  two  types  ol  techniques  lor  managing  the  perlormance  ol  a  redundant  disk 
array  during  on-line  lailure  recovery.  First,  we  examined  how  the  organi/.aiion  ol  data  and  parity 
in  the  array  determines  the  amount  ol  work  that  mu.st  be  done  to  recover  the  contents  ol  a  Tailed 
disk.  .Second,  we  explored  alternative  strategics  Tor  executing  this  recovery  with  particular  inier- 
est  in  the  trade-oTT  between  cost,  failure  recovery  time  and  performance  during  recovery. 
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(b)  a  =  0.50 

Figure  25;  The  iradc-ol't  between  reliability  and  capacity  lor  a  =  0.25  and  a  =  0.50. 

The  data  loss  probabilities  are  plotted  on  a  lot;  scale,  while  the  capacitx  overhead  sc(dc  is  linear 

The  mo.st  common  disk  array  organization  used  lor  controlling  data  reliability  and  on-line 
tailure  recovery  pertormance  is  based  on  dividing  the  array  into  multiple  indepeiuleni  groups.  In 
thi.,  case  mo.st  acce.s.ses  will  not  sun'er  any  degradation  during  on-line  lailure  recovery.  I  nloi in¬ 
nately.  if  a  RAID  Level  5  organization  is  u.sed  in  each  group,  .some  accesses  may  e.xperieiice  a 
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large  degradation  in  performanee.  In  eonirasl,  dparin  da  lust  a  in;:,  orgaiii/alion  lor  the  lull  si/e 
array  distributes  recovery  work  over  all  disks,  lightly  degrading  the  perlorniance  ol  all  accesses. 
For  the  arrays  we  investigated,  a  parity  declustering  organi/.ation  supports,  belore  saturating,  a 
4()-5()‘^  higher  u.ser  workload  than  a  cost-equivalent  multiple-group  RAID  Level  .s  ana\.  When 
we  considered  only  a  single  group  and  varied  the  amount  ol  the  array  s  capacity  sacriliced  lor 
redundant  inl'ormation,  we  found  that  increa,sed  declustering  of  parity  can  reduce  average  and 
9()th  percentile  u.ser  response  time  by  a  factor  of  two  in  both  degraded  mode  and  reconsiruction 
mode,  and  can  reduce  reconstruction  time  by  up  to  an  order  of  magnitude.  Parity  declusiering. 
then,  provides  a  powerful  and  Ilexible  mechanism  for  balancing  cost,  failure  recovery  perlor- 
mance.  and  reconstruction  time. 

For  either  organi/.ation  of  data  and  parity  in  an  array,  a  second  important  techniciue  for 
improving  the  failure-mode  performance  is  to  tune  the  reconstruction  algorithm.  We  presented  a 
disk-oriented  reconstruction  algorithm,  and  demonstrated  that  it  yields  up  to  40'.^  faster  recon¬ 
struction  than  the  more  common  stripe-oriented  approach,  while  maintaining  similar  u.ser  respon- 
sivene.ss.  We  also  investigated  the  benetits  and  drawbacks  of  three  modilications  to  the 
reconstruction  algorithm,  concluding  that  for  read-dominated  workloads  such  as  ha\e  been 
observed  in  OLTP  traces,  the  only  option  that  has  signilicant  impact  on  failure-mode  perlormance 
is  whether  u.ser  reads  to  previously-reconstructed  data  were  .serviced  by  the  replacement  disk  or 
by  the  surviving  disks  (the  redirection  of  reads  option).  Since  the  beneht  oi  redirection  is  coniigti- 
ration-depcndcnt.  we  analyzed  a  propo.sed  technique  for  optimally  controlling  its  application 
ba.sed  on  observed  disk  utilizations.  We  concluded  that  the  strategy  does  yield  optimal  reconstruc¬ 
tion  time,  but  that  the  simpler  strategy  of  applying  redirection  at  for  all  applicable  accesses  allows 
the  system  to  achieve  about  \(Y'/<  better  u.ser  respon.se  time  for  certain  conligurations. 

In  the  linal  .section  of  the  paper  we  di.scu.s.sed  trade-offs  involved  in  determining  the  coiiiigti- 
ration  of  large  arrays,  returning  to  the  que.stion  of  when  it  is  necessary  to  partition  large  arrays  into 
multiple  independent  groups  to  achieve  acceptable  data  reliability.  We  found  that,  in  very  large 
arrays,  parity  declustering  and  partitioning  can  increa.se  data  reliability  by  an  order  of  magnitude 
while  maintaining  good  on-line  failure  recovery  performance  and  requiring  a  capacity  overhead 
for  parity  in  the  range  ol  5-2{y/f , 
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There  remain  several  areas  Ici  explore  in  ihe  topic  ol  I'ailure  recovery.  Firsl  because  pariiy- 
based  redimdanl  disk  aiTays  exhibit  small-write  performance  that  is  up  to  a  tacior  of  lour  worse 
than  non-redundant  arrays,  and  a  factor  of  two  worse  than  mirrored  arrays,  ii  is  highly  desirable  lo 
combine  parity  declustering  with  ixirity  lStodolskyd31  or  loy-stnn  turcd  fih'  M 

IRo.senblumd  1 1.  both  techniques  for  improving  this  small-wriie  performance  m  disk  arrays.  Sec¬ 
ond.  the  block-design  ba.sed  layout  could  be  made  much  more  general  by  relaxing  the  leLiuire- 
ments  on  the  tuples  u.sed  for  layout.  For  example,  it  might  be  possible  to  derive  a  balanced  layout 
from  a  [Jdckiny  or  mvcriny  lMill.sd2|  instead  of  an  actual  block  design,  or  a  layout  might  be 
derived  from  a  design  m  which  the  number  of  objects  per  tuple  is  not  constant.  Each  ol  these 
approaches  would  expand  the  range  ol  conligurations  that  can  be  implemented  using  the  block- 
design-ba.sed  layout  pre.sented  in  this  paper.  Finally,  implementing  distributed  sparing 
lMenon92b|  in  a  declustered  array  could  eliminate  the  replacement  disk  as  a  reconstruction  bot¬ 
tleneck  for  low  values  ot  the  declusiering  ratio  tat,  and  perhaps  yield  exiremelv  last  leconsiiuc- 
tion. 
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